Compare commits

..

74 Commits

Author SHA1 Message Date
Georgi Gerganov
956bb14595 examples : remove --instruct remnants 2024-06-10 08:37:47 +03:00
Georgi Gerganov
10ceba354a flake.lock: Update (#7838)
Flake lock file updates:

• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29)
  → 'github:NixOS/nixpkgs/051f920625ab5aabe37c920346e3e69d7d34400e?narHash=sha256-4q0s6m0GUcN7q%2BY2DqD27iLvbcd1G50T2lv08kKxkSI%3D' (2024-06-07)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-09 16:04:50 -07:00
Georgi Gerganov
e95beeb1fc imatrix : handle partial entries (#7833) 2024-06-09 20:19:35 +03:00
Nicolás Pérez
57bf62ce7c docs: Added initial PR template with directions for doc only changes and squash merges [no ci] (#7700)
This commit adds pull_request_template.md and CONTRIBUTING.md . It focuses on explaining to contributors the need to rate PR complexity level, when to add [no ci] and how to format PR title and descriptions.

Co-authored-by: Brian <mofosyne@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2024-06-10 01:24:29 +10:00
mgroeber9110
3e2ee44315 server: do not remove whitespace at the start of a completion chunk (#7830) 2024-06-09 20:50:35 +10:00
Johannes Gäßler
42b53d192f CUDA: revise q8_1 data layout for mul_mat_q (#7824) 2024-06-09 09:42:25 +02:00
sasha0552
2decf57bc6 convert-hf : set the model name based on cli arg, if present (#7693)
`--model-name` argument was added a while ago but did not do anything.
This commit fixes this issue and enables this feature.
2024-06-09 16:39:25 +10:00
compilade
5795b94182 convert-hf : match model part name prefix and suffix (#7687)
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. 

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
persistent problem, but shall do in the meantime.
2024-06-09 12:47:25 +10:00
compilade
ed9f252118 gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
slaren
fe1e3917cf Revert "[SYCL] Update rpc-server.cpp to include SYCL backend (#7682)" (#7808)
This reverts commit 9422c5e34b.
2024-06-09 01:43:39 +02:00
Olivier Chafik
d4d915d351 url: save -mu downloads to new cache location (#7826)
* url: save -mu download to new cache location

* url: fs_get_cache_file_path util

* url: tweak sig of fs_get_cache_file
2024-06-08 21:21:08 +02:00
sasha0552
7a16ce7db2 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
slaren
da799b4189 vulkan : reuse parent extra for views (#7806)
* vulkan : reuse parent extra for views

* Fix validation error when multiple compute contexts are used in a graph

---------

Co-authored-by: 0cc4m <picard12@live.de>
2024-06-07 19:47:49 +02:00
Christian Zhou-Zheng
c00fad71e5 gguf-split : change binary multi-byte units to decimal (#7803) 2024-06-07 15:56:01 +03:00
intelmatt
27615f5ab2 cmake : fix BUILD_SHARED_LIBS=ON build (#7784)
common depends on pthreads in Linux
2024-06-07 15:15:07 +03:00
Johannes Gäßler
7027b27d76 server: update cache_prompt documentation [no ci] (#7745) 2024-06-07 11:15:49 +02:00
woodx
a5cabd7649 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
pengxin99
d5c938cd77 [SYCL] fix softmax r2r result wrong issue (#7811) 2024-06-07 14:28:26 +08:00
slaren
c9ee7118d5 check for nans in imatrix and quantize (#7807)
* imatrix : detect nan/inf values

* quantize : check imatrix for nan/inf values
2024-06-07 09:01:29 +03:00
Georgi Gerganov
ee459f40f6 server : fix --threads-http arg (#7801) 2024-06-06 19:19:59 +03:00
Georgi Gerganov
f83351f9a6 imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Clint Herron
ad675e1c67 Added support for . (any character) token in grammar engine. (#6467)
* Added support for . (any characer) token in grammar engine.

* Add integration tests for any-character symbol.
2024-06-06 06:08:52 -07:00
Mattheus Chediak
a143c04375 README minor fixes (#7798) [no ci]
derievatives --> derivatives
2024-06-06 22:17:54 +10:00
Olivier Chafik
55b2d0849d grammars: x{min,max} repetition operator (#6640)
* grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates

* grammars: handle `x{n}` and fix `x{n,n}`

* grammars: document new repetition operators

* grammars: uniform use of int for min & max

* grammars: refactor parser test

* grammar: parsing tests w/ natural pretty print of updated expectations

* grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all)

* grammars: improve test pretty print again

* grammars: pretty print rules and chars

* grammars: fix copy rule skipping

* grammars: disallow `a{,}` (not allowed in regexps)

* Update common/grammar-parser.cpp

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: fix copy rule skipping (again) & display of expectations

* grammars: more test cases

* grammars: update reps parsing to bring ? / * / + closer to before

* json: use new GBNF repetitions{m,n} syntax

* grammars: update performance gotchas w/ repetition advice

* Update examples/json_schema_to_grammar.py

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: comment on rule repetitions

* grammars: ensure unambiguous number alternatives

* grammar: nit typo switched error msgs

* grammar: nit numbering in comment

* json: update numeric rule to be unambiguous

* Apply suggestions from code review

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* json: fix integral-part

* grammar: add repetition tests

---------

Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-06 10:07:06 +01:00
Joan Fontanals
f5d7b268ec llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
slaren
2d08b7fbb4 docker : build only main and server in their images (#7782)
* add openmp lib to dockerfiles

* build only main and server in their docker images
2024-06-06 08:19:49 +03:00
slaren
d67caea0d6 docker : add openmp lib (#7780) 2024-06-06 08:17:21 +03:00
Galunid
7672adeec7 Fix encoding in python scripts (#7733) 2024-06-06 03:07:24 +10:00
Johannes Gäßler
7d1a378b8f CUDA: refactor mmq, dmmv, mmvq (#7716)
* CUDA: refactor mmq, dmmv, mmvq

* fix out-of-bounds write

* struct for qk, qr, qi

* fix cmake build

* mmq_type_traits
2024-06-05 16:53:00 +02:00
Georgi Gerganov
2b3389677a ggml : refactor rope norm/neox (#7634)
* ggml : unify rope norm/neox (CPU)

* ggml : fix compile warning

* ggml : remove GLM rope mode

ggml-ci

* metal : better rope implementation

ggml-ci

* cuda : better rope implementation

ggml-ci

* naming : n_orig_ctx -> n_ctx_orig

ggml-ci

* dev : add reminders to update backends

ggml-ci

* vulkan : fix ggml_rope_ext() usage

* cuda : fix array size + indents

ggml-ci
2024-06-05 11:29:20 +03:00
arch-btw
9973e81c5c readme : remove -ins (#7759)
-ins and --instruct were moved in https://github.com/ggerganov/llama.cpp/pull/7675

I have adjusted the README accordingly.
There was no trace of --chatml in the README.
2024-06-05 09:40:49 +03:00
jaime-m-p
c90dbe026b Fix per token atrributes bits (#7749) 2024-06-05 01:26:14 +02:00
agray3
b90dc566c1 Allow number of nodes in CUDA graph to change (#7738)
Previously the code would have failed to cope in the case that the
number of nodes changes in an existing CUDA graph. This fixes the
issue by removing an unnecessary conditional.
2024-06-04 22:06:49 +02:00
Georgi Gerganov
1442677f92 common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Georgi Gerganov
554c247caf ggml : remove OpenCL (#7735)
ggml-ci
2024-06-04 21:23:20 +03:00
Georgi Gerganov
0cd6bd3483 llama : remove beam search (#7736) 2024-06-04 21:23:05 +03:00
Georgi Gerganov
5ca0944a15 readme : remove obsolete Zig instructions (#7471) 2024-06-04 19:43:01 +03:00
slaren
adc9ff3841 llama-bench : allow using a different printer for stderr with -oe (#7722)
compare-commits.sh : hide stdout, use -oe to print markdown
2024-06-04 14:32:42 +02:00
Daniele
987d743d6b Improve hipBLAS support in CMake (#7696)
* Improve hipBLAS support in CMake

This improves the detection of the correct CMAKE_PREFIX_PATH when using different distributions or a self-built ROCm SDK.

* Set ROCM_PATH correctly
2024-06-04 14:09:15 +02:00
zhouwg
b226c1227b refine .gitignore (#7688)
This adds tags and android ndk into the git ignore list
2024-06-04 21:21:26 +10:00
jaime-m-p
3b38d48609 Per token attributes (#7685)
* Add per token attributes enum
* Using phi-3 for testing 'rstrip'
* Using jina-v2 for testing 'lstrip'
* Brute force test for 'lstrip' and 'rstrip'
* Implement 'rstrip' and 'lstrip'
* Update phi-3 GGUF file (obsolete since 917dc8c)
* Replace llama_token_type with llama_token_attribs
2024-06-04 09:17:17 +02:00
Georgi Gerganov
6d1616944d ggml : prevent builds with -ffinite-math-only (#7726)
This enforces a check that -fno-finite-math-only was set and that the operating
compiling mode is not in finite maths mode. This is because during rewriting of
silu and softmax for cpu #7154 there emerged an issue where the result that was
observed when >1 slot was nondeterministic as found by @JohannesGaessler.

@LostRuins narrowed the problem down to -ffinite-math-only which was theorised
to be due to SiLU, instead of flushing small values to 0, returns NaN or some 
other garbage. @jart proposed a fix that @ggerganov then implemented in this fix

ref https://github.com/ggerganov/llama.cpp/pull/7154#issuecomment-2145661825
2024-06-04 17:01:09 +10:00
Radoslav Gerganov
bde7cd3cd9 llama : offload to RPC in addition to other backends (#7640)
* llama : offload to RPC in addition to other backends

* - fix copy_tensor being called on the src buffer instead of the dst buffer

- always initialize views in the view_src buffer

- add RPC backend to Makefile build

- add endpoint to all RPC object names

* add rpc-server to Makefile

* Update llama.cpp

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-06-03 20:03:26 +03:00
Masaya, Kato
a5735e4426 ggml : use OpenMP as a thread pool (#7606)
* ggml: Added OpenMP for multi-threads processing

* ggml : Limit the number of threads used to avoid deadlock

* update shared state n_threads in parallel region

* clear numa affinity for main thread even with openmp

* enable openmp by default

* fix msvc build

* disable openmp on macos

* ci : disable openmp with thread sanitizer

* Update ggml.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-03 17:14:15 +02:00
Johannes Gäßler
0b832d53ba make: fix debug options not being applied to NVCC (#7714) 2024-06-03 16:28:58 +02:00
0cc4m
3d7ebf6312 Vulkan Mixture of Experts (MoE) support (#7628)
* Finish Vulkan mul_mat_id implementation

* Add Vulkan sum_rows and div ops

* Fix MUL_MAT_ID matrix matrix shader

* Fix MUL_MAT_ID matrix vector shader dispatch size

* Fix MUL_MAT_ID matrix vector shader and dispatch code

* Update Vulkan CPU offload for MUL_MAT_ID

* Fix crash when using split mode none and setting a main GPU
2024-06-03 10:59:14 +02:00
Andy Tai
a10cda58d3 cmake : add pkg-config spec file for llama.cpp (#7702) 2024-06-03 11:06:24 +03:00
zhangkaihuo
6f28a333c1 llama : MiniCPM support tied embeddings (#7664)
* support lm_head

* remove the code block

---------

Co-authored-by: zhangkaihuo <zhangkaihuo@modelbest.cn>
2024-06-03 10:49:30 +03:00
Georgi Gerganov
549279d804 llama : avoid double token-to-piece cache (#7654)
ggml-ci
2024-06-03 08:34:43 +03:00
woachk
9e405b6e2e kompute : implement op_getrows_f32 (#6403)
op_getrows_f32 is required since https://github.com/ggerganov/llama.cpp/pull/6122
for the Vulkan w/ Kompute backend to be functional.

As such, implement this op to make this backend functional again.
2024-06-03 08:32:16 +03:00
Dave Airlie
3413ae2193 fix bug introduced in using calloc (#7701)
compilade pointed this out on the previous MR
2024-06-02 17:59:54 -04:00
Georgi Gerganov
1669810d7c flake.lock: Update (#7686)
Flake lock file updates:

• Updated input 'flake-parts':
    'github:hercules-ci/flake-parts/8dc45382d5206bd292f9c2768b8058a8fd8311d9?narHash=sha256-/GJvTdTpuDjNn84j82cU6bXztE0MSkdnTWClUCRub78%3D' (2024-05-16)
  → 'github:hercules-ci/flake-parts/2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8?narHash=sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw%3D' (2024-06-01)
• Updated input 'flake-parts/nixpkgs-lib':
    '50eb7ecf4c.tar.gz?narHash=sha256-QBx10%2Bk6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94%3D' (2024-05-02)
  → 'eb9ceca17d.tar.gz?narHash=sha256-lIbdfCsf8LMFloheeE6N31%2BBMIeixqyQWbSr2vk79EQ%3D' (2024-06-01)
• Updated input 'nixpkgs':
    'github:NixOS/nixpkgs/bfb7a882678e518398ce9a31a881538679f6f092?narHash=sha256-4zSIhSRRIoEBwjbPm3YiGtbd8HDWzFxJjw5DYSDy1n8%3D' (2024-05-24)
  → 'github:NixOS/nixpkgs/ad57eef4ef0659193044870c731987a6df5cf56b?narHash=sha256-SzDKxseEcHR5KzPXLwsemyTR/kaM9whxeiJohbL04rs%3D' (2024-05-29)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2024-06-02 14:13:12 -07:00
Austin
7c4e5b7eae chore : add ignore rule for generated server themes (#7689) 2024-06-02 20:39:08 +03:00
nickp27
9422c5e34b [SYCL] Update rpc-server.cpp to include SYCL backend (#7682)
* Update rpc-server.cpp to include SYCL backend

Draft PR to address inclusion of SYCL backend for RPC server

* Update rpc-server.cpp
2024-06-02 12:13:54 +03:00
Johannes Gäßler
e141ce624a Fix FlashAttention debug test, FP32 assert (#7684) 2024-06-01 23:26:10 +02:00
Yazan Agha-Schrader
2e666832e6 server : new UI (#7633)
* ic

* migrate my eary work

* add the belonging stuff: css,favicon etc

* de prompts

* chore: Update HTML meta tags in index.html file

* add api-key css classes

* some necessary fixes

* Add API key CSS classes and update styling in style.css

* clean the code

* move API to the top, rearrange param sliders. update css

* add tooltips to the parameters with comprehensible explanations

* fix FloatField and BoolField tooltips

* fix grammar field width

* use template literales for promptFormats.js

* update const ModelGenerationInfo

* remove ms per token, since not relevant for most webui users and use cases

* add phi-3 prompt template

* add phi3 to dropdown

* add css class

* update forgotten css theme

* add user message suffix

* fix chatml & add llama3 format

* fix llama3 prompt template

* more prompt format fixes

* add more comon stop tokens

* add missing char

* do not separate with new line or comma

* move prompt style

* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js

* fix toggle state localstorage

* add cmd-r prompt et reduce redundancy

* set default prompt to empty

* move files, clean code

* fix css path

* add a button to the new ui

* move new ui to "/public" due to otherwise problematic CORS behaviour

* include new ui in cpp

* fix wrong link to old ui

* renaming to ensure consistency

* fix typos "prompt-format" -> "prompt-formats"

* use correct indent

* add new ui files to makefile

* fix typo
2024-06-01 22:31:48 +03:00
HanishKVC
2ac95c9d56 SimpleChat: Simple histogram/repeatMatching driven garbageTrimming, Settings UI, Streaming mode, OpenAi Compat (Model, Authorization Bearer), Save/Restore session, Auto Settings UI (#7548)
* SimpleChat:DU:BringIn local helper js modules using importmap

Use it to bring in a simple trim garbage at end logic, which is
used to trim received response.

Also given that importmap assumes esm / standard js modules, so
also global variables arent implicitly available outside the
modules. So add it has a member of document for now

* SimpleChat:DU: Add trim garbage at end in loop helper

* SimpleChat:DU:TrimGarbage if unable try skip char and retry

* SimpleChat:DU: Try trim using histogram based info

TODO: May have to add max number of uniq chars in histogram at
end of learning phase.

* SimpleChat:DU: Switch trim garbage hist based to maxUniq simple

Instead of blindly building histogram for specified substring
length, and then checking if any new char within specified min
garbage length limit, NOW exit learn state when specified maxUniq
chars are found. Inturn there should be no new chars with in
the specified min garbage length required limit.

TODO: Need to track char classes like alphabets, numerals and
special/other chars.

* SimpleChat:DU: Bring in maxType to the mix along with maxUniq

Allow for more uniq chars, but then ensure that a given type of
char ie numerals or alphabets or other types dont cross the
specified maxType limit. This allows intermixed text garbage
to be identified and trimmed.

* SimpleChat:DU: Cleanup debug log messages

* SimpleChat:UI: Move html ui base helpers into its own module

* SimpleChat:DU:Avoid setting frequence/Presence penalty

Some models like llama3 found to try to be over intelligent by
repeating garbage still, but by tweaking the garbage a bit so that
it is not exactly same. So avoid setting these penalties and let
the model's default behaviour work out, as is.

Also the simple minded histogram based garbage trimming from end,
works to an extent, when the garbage is more predictable and
repeatative.

* SimpleChat:UI: Add and use a para-create-append helper

Also update the config params dump to indicate that now one needs
to use document to get hold of gMe global object, this is bcas of
moving to module type js.

Also add ui.mjs to importmap

* SimpleChat:UI: Helper to create bool button and use it wrt settings

* SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt

* SimpleChat:UI:Select: dict-name-value, value wrt default, change

Take a dict/object of name-value pairs instead of just names.
Inturn specify the actual value wrt default, rather than the
string representing that value.

Trap the needed change event rather than click wrt select.

* SimpleChat:UI: Add Div wrapped label+element helpers

Move settings related elements to use the new div wrapped ones.

* SimpleChat:UI:Add settings button and bring in settings ui

* SimpleChat:UI:Settings make boolean button text show meaning

* SimpleChat: Update a bit wrt readme and notes in du

* SimpleChat: GarbageTrim enable/disable, show trimmed part ifany

* SimpleChat: highlight trim, garbage trimming bitmore aggressive

Make it easy for end user to identified the trimmed text.

Make garbage trimming logic, consider a longer repeat garbage
substring.

* SimpleChat: Cleanup a bit wrt Api end point related flow

Consolidate many of the Api end point related basic meta data into
ApiEP class.

Remove the hardcoded ApiEP/Mode settings from html+js, instead use
the generic select helper logic, inturn in the settings block.

Move helper to generate the appropriate request json string based
on ApiEP into SimpleChat class itself.

* SimpleChat:Move extracting assistant response to SimpleChat class

so also the trimming of garbage.

* SimpleChat:DU: Bring in both trim garbage logics to try trim

* SimpleChat: Cleanup readme a bit, add one more chathistory length

* SimpleChat:Stream:Initial handshake skeleton

Parse the got stream responses and try extract the data from it.

It allows for a part read to get a single data line or multiple
data line. Inturn extract the json body and inturn the delta
content/message in it.

* SimpleChat: Move handling oneshot mode server response

Move handling of the oneshot mode server response into SimpleChat.

Also add plumbing for moving multipart server response into same.

* SimpleChat: Move multi part server response handling in

* SimpleChat: Add MultiPart Response handling, common trimming

Add logic to call into multipart/stream server response handling.

Move trimming of garbage at the end into the common handle_response
helper.

Add new global flag to control between oneshot and multipart/stream
mode of fetching response. Allow same to be controlled by user.

If in multipart/stream mode, send the stream flag to the server.

* SimpleChat: show streamed generative text as it becomes available

Now that the extracting of streamed generated text is implemented,
add logic to show the same on the screen.

* SimpleChat:DU: Add NewLines helper class

To work with an array of new lines. Allow adding, appending,
shifting, ...

* SimpleChat:DU: Make NewLines shift more robust and flexible

* SimpleChat:HandleResponseMultiPart using NewLines helper

Make handle_response_multipart logic better and cleaner. Now it
allows for working with the situation, where the delta data line
got from server in stream mode, could be split up when recving,
but still the logic will handle it appropriately.

ALERT: Rather except (for now) for last data line wrt a request's
response.

* SimpleChat: Disable console debug by default by making it dummy

Parallely save a reference to the original func.

* SimpleChat:MultiPart/Stream flow cleanup

Dont try utf8-decode and newlines-add_append if no data to work on.

If there is no more data to get (ie done is set), then let NewLines
instance return line without newline at end, So that we dont miss
out on any last-data-line without newline kind of scenario.

Pass stream flag wrt utf-8 decode, so that if any multi-byte char
is only partly present in the passed buffer, it can be accounted
for along with subsequent buffer. At sametime, bcas of utf-8's
characteristics there shouldnt be any unaccounted bytes at end,
for valid block of utf8 data split across chunks, so not bothering
calling with stream set to false at end. LATER: Look at TextDecoder's
implementation, for any over intelligence, it may be doing..
If needed, one can use done flag to account wrt both cases.

* SimpleChat: Move baseUrl to Me and inturn gMe

This should allow easy updating of the base url at runtime by the
end user.

* SimpleChat:UI: Add input element helper

* SimpleChat: Add support for changing the base url

This ensures that if the user is running the server with a
different port or wants to try connect to server on a different
machine, then this can be used.

* SimpleChat: Move request headers into Me and gMe

Inturn allow Authorization to be sent, if not empty.

* SimpleChat: Rather need to use append to insert headers

* SimpleChat: Allow Authorization header to be set by end user

* SimpleChat:UI+: Return div and element wrt creatediv helpers

use it to set placeholder wrt Authorization header.

Also fix copy-paste oversight.

* SimpleChat: readme wrt authorization, maybe minimal openai testing

* SimpleChat: model request field for openai/equivalent compat

May help testing with openai/equivalent web services, if they
require this field.

* SimpleChat: readme stream-utf-8 trim-english deps, exception2error

* Readme: Add a entry for simplechat in the http server section

* SimpleChat:WIP:Collate internally, Stream mode Trap exceptions

This can help ensure that data fetched till that point, can be
made use of, rather than losing it.

On some platforms, the time taken wrt generating a long response,
may lead to the network connection being broken when it enters
some user-no-interaction related power saving mode.

* SimpleChat:theResp-origMsg: Undo a prev change to fix non trim

When the response handling was moved into SimpleChat, I had changed
a flow bit unnecessarily and carelessly, which resulted in the non
trim flow, missing out on retaining the ai assistant response.

This has been fixed now.

* SimpleChat: Save message internally in handle_response itself

This ensures that throwing the caught exception again for higher
up logic, doesnt lose the response collated till that time.

Go through theResp.assistant in catch block, just to keep simple
consistency wrt backtracing just in case.

Update the readme file.

* SimpleChat:Cleanup: Add spacing wrt shown req-options

* SimpleChat:UI: CreateDiv Divs map to GridX2 class

This allows the settings ui to be cleaner structured.

* SimpleChat: Show Non SettingsUI config field by default

* SimpleChat: Allow for multiline system prompt

Convert SystemPrompt into a textarea with 2 rows. Reduce
user-input-textarea to 2 rows from 3, so that overall
vertical space usage remains same.

Shorten usage messages a bit, cleanup to sync with settings ui.

* SimpleChat: Add basic skeleton for saving and loading chat

Inturn when ever a chat message (system/user/model) is added,
the chat will be saved into browser's localStorage.

* SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key

* SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat

This is a temporary flow

* SimpleChat:ODS:Move restore/load saved chat btn setup to Me

This also allows being able to set the common system prompt
ui element to loaded chat's system prompt.

* SimpleChat:Readme updated wrt save and restore chat session info

* SimpleChat:Show chat session restore button, only if saved session

* SimpleChat: AutoCreate ChatRequestOptions settings to an extent

* SimpleChat: Update main README wrt usage with server
2024-06-02 02:20:18 +10:00
Johannes Gäßler
750f60c03e CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (#7681) 2024-06-01 15:47:04 +02:00
Johannes Gäßler
9b596417af CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec

* try CI fix

* fix commented-out kernel variants

* add q8_0 q4_0 tests

* fix nwarps > batch size

* split fattn compile via extern templates

* fix flake8

* fix metal tests

* fix cmake

* make generate_cu_files.py executable

* add autogenerated .cu files

* fix AMD

* error if type_v != FP16 and not flash_attn

* remove obsolete code
2024-06-01 08:44:14 +02:00
Georgi Gerganov
a323ec60af server : update js (#7670) 2024-05-31 22:23:04 +03:00
Galunid
0515ad93f4 convert-hf : Handle NotImplementedError in convert-hf-to-gguf (#7660) 2024-05-31 17:42:33 +02:00
Johannes Gäßler
c8047d538f scripts: update compare_llama_bench.py [no ci] (#7673) 2024-05-31 16:26:21 +02:00
Daniele
30e238b246 Improve HIP compatibility (#7672) 2024-05-31 16:00:29 +02:00
Georgi Gerganov
16926dff92 readme : link homebrew discussion 2024-05-31 15:04:58 +03:00
Georgi Gerganov
0c27e6f62e ggml : fix loongson compile warnings (#7537)
* ggml : fix loongson compile warnings

ggml-ci

* Fix loongarch quantize test fail.

Fix unexpected error introduced during rebase code.

* tests : disable json test due to lack of python on the CI node

ggml-ci

---------

Co-authored-by: junchao-loongson <zhaojunchao@loongson.cn>
2024-05-31 14:17:10 +03:00
Galunid
2e32f874e6 Somehow '**' got lost (#7663) 2024-05-31 18:24:41 +10:00
Galunid
1af511fc22 Add convert.py removal to hot topics (#7662) 2024-05-31 10:09:20 +02:00
Sertaç Özercan
0541f06296 [no ci] docs: add aikit to readme (#7650)
Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
2024-05-31 09:57:16 +10:00
JohnnyB
9022c33646 Fixed painfully slow single process builds. (#7326)
* Fixed painfully slow single process builds.

* Added nproc for systems that don't default to nproc
2024-05-30 22:32:38 +02:00
Georgi Gerganov
5921b8f089 llama : cache llama_token_to_piece (#7587)
* llama : cache llama_token_to_piece

ggml-ci

* llama : use vectors and avoid has_cache

ggml-ci

* llama : throw on unknown tokenizer types

ggml-ci

* llama : print a log of the total cache size
2024-05-31 02:01:41 +10:00
Martin Delille
5dcdf94676 Fix conan badge display [no ci] (#7645) 2024-05-31 01:07:39 +10:00
Manuel
2e2340de17 Add brew installation instruction to README [no ci] (#7616) 2024-05-31 00:58:15 +10:00
Martin Delille
7846540bd2 readme : add Conan badge (#7638) 2024-05-30 15:52:50 +03:00
Brian
e6157f94c8 github: add contact links to issues and convert question into research [no ci] (#7612) 2024-05-30 21:55:36 +10:00
267 changed files with 87174 additions and 23652 deletions

View File

@@ -12,7 +12,7 @@ FROM ${BASE_CUDA_DEV_CONTAINER} as build
ARG CUDA_DOCKER_ARCH=all
RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev libgomp1
COPY requirements.txt requirements.txt
COPY requirements requirements
@@ -31,6 +31,6 @@ ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1
RUN make
RUN make -j$(nproc)
ENTRYPOINT ["/app/.devops/tools.sh"]

View File

@@ -45,6 +45,6 @@ ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
RUN make
RUN make -j$(nproc)
ENTRYPOINT ["/app/.devops/tools.sh"]

View File

@@ -3,7 +3,7 @@ ARG UBUNTU_VERSION=22.04
FROM ubuntu:$UBUNTU_VERSION as build
RUN apt-get update && \
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev
apt-get install -y build-essential python3 python3-pip git libcurl4-openssl-dev libgomp1
COPY requirements.txt requirements.txt
COPY requirements requirements
@@ -18,7 +18,7 @@ COPY . .
ENV LLAMA_CURL=1
RUN make
RUN make -j$(nproc)
ENV LC_ALL=C.utf8

View File

@@ -23,10 +23,13 @@ ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
# Enable CUDA
ENV LLAMA_CUDA=1
RUN make
RUN make -j$(nproc) main
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime
RUN apt-get update && \
apt-get install -y libgomp1
COPY --from=build /app/main /main
ENTRYPOINT [ "/main" ]

View File

@@ -40,6 +40,6 @@ ENV LLAMA_HIPBLAS=1
ENV CC=/opt/rocm/llvm/bin/clang
ENV CXX=/opt/rocm/llvm/bin/clang++
RUN make
RUN make -j$(nproc) main
ENTRYPOINT [ "/app/main" ]

View File

@@ -3,7 +3,7 @@ ARG UBUNTU_VERSION=jammy
FROM ubuntu:$UBUNTU_VERSION as build
# Install build tools
RUN apt update && apt install -y git build-essential cmake wget
RUN apt update && apt install -y git build-essential cmake wget libgomp1
# Install Vulkan SDK
RUN wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | apt-key add - && \

View File

@@ -9,10 +9,13 @@ WORKDIR /app
COPY . .
RUN make
RUN make -j$(nproc) main
FROM ubuntu:$UBUNTU_VERSION as runtime
RUN apt-get update && \
apt-get install -y libgomp1
COPY --from=build /app/main /main
ENV LC_ALL=C.utf8

View File

@@ -25,12 +25,12 @@ ENV LLAMA_CUDA=1
# Enable cURL
ENV LLAMA_CURL=1
RUN make
RUN make -j$(nproc) server
FROM ${BASE_CUDA_RUN_CONTAINER} as runtime
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
apt-get install -y libcurl4-openssl-dev libgomp1
COPY --from=build /app/server /server

View File

@@ -45,6 +45,6 @@ ENV LLAMA_CURL=1
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
RUN make
RUN make -j$(nproc)
ENTRYPOINT [ "/app/server" ]

View File

@@ -11,12 +11,12 @@ COPY . .
ENV LLAMA_CURL=1
RUN make
RUN make -j$(nproc) server
FROM ubuntu:$UBUNTU_VERSION as runtime
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev
apt-get install -y libcurl4-openssl-dev libgomp1
COPY --from=build /app/server /server

View File

@@ -1,38 +0,0 @@
name: Question
description: Used to ask questions about llama.cpp
title: "Question: "
labels: ["question"]
body:
- type: markdown
attributes:
value: |
[Please search your question first in Discussion if you got a common general question.](https://github.com/ggerganov/llama.cpp/discussions/categories/q-a)
- type: checkboxes
id: prerequisites
attributes:
label: Prerequisites
description: Please confirm the following before submitting your question.
options:
- label: I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
required: true
- label: I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new useful question to share that cannot be answered within Discussions.
required: true
- type: textarea
id: background-description
attributes:
label: Background Description
description: Please provide a detailed written description of what you were trying to do, and what you expected `llama.cpp` to do as an question.
placeholder: Detailed description of your question
validations:
required: true
- type: textarea
id: possible-answer
attributes:
label: Possible Answer
description: If you have some idea of possible answers you want to confirm, that would also be appreciated.
placeholder: Your idea of possible answers
validations:
required: false

52
.github/ISSUE_TEMPLATE/06-research.yml vendored Normal file
View File

@@ -0,0 +1,52 @@
name: Research
description: Track new technical research area
title: "Research: "
labels: ["research 🔬"]
body:
- type: markdown
attributes:
value: |
Don't forget to check for any [duplicate research issue tickets](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3A%22research+%F0%9F%94%AC%22)
- type: checkboxes
id: research-stage
attributes:
label: Research Stage
description: Track general state of this research ticket
options:
- label: Background Research (Let's try to avoid reinventing the wheel)
- label: Hypothesis Formed (How do you think this will work and it's effect?)
- label: Strategy / Implementation Forming
- label: Analysis of results
- label: Debrief / Documentation (So people in the future can learn from us)
- type: textarea
id: background
attributes:
label: Previous existing literature and research
description: Whats the current state of the art and whats the motivation for this research?
- type: textarea
id: hypothesis
attributes:
label: Hypothesis
description: How do you think this will work and it's effect?
- type: textarea
id: implementation
attributes:
label: Implementation
description: Got an approach? e.g. a PR ready to go?
- type: textarea
id: analysis
attributes:
label: Analysis
description: How does the proposed implementation behave?
- type: textarea
id: logs
attributes:
label: Relevant log output
description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.
render: shell

13
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,13 @@
blank_issues_enabled: true
contact_links:
- name: Got an idea?
url: https://github.com/ggerganov/llama.cpp/discussions/categories/ideas
about: Pop it there. It may then become an enhancement ticket.
- name: Got a question?
url: https://github.com/ggerganov/llama.cpp/discussions/categories/q-a
about: Ask a question there!
- name: Want to contribute?
url: https://github.com/ggerganov/llama.cpp/wiki/contribute
about: Head to the contribution guide page of the wiki for areas you can help with

View File

@@ -0,0 +1,5 @@
- Self Reported Review Complexity:
- [ ] Review Complexity : Low
- [ ] Review Complexity : Medium
- [ ] Review Complexity : High
- [ ] I have read the [contributing guidelines](CONTRIBUTING.md)

View File

@@ -294,12 +294,22 @@ jobs:
- name: Build
id: cmake_build
if: ${{ matrix.sanitizer != 'THREAD' }}
run: |
mkdir build
cd build
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }}
cmake --build . --config ${{ matrix.build_type }} -j $(nproc)
- name: Build (no OpenMP)
id: cmake_build_no_openmp
if: ${{ matrix.sanitizer == 'THREAD' }}
run: |
mkdir build
cd build
cmake .. -DLLAMA_FATAL_WARNINGS=ON -DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} -DLLAMA_OPENMP=OFF
cmake --build . --config ${{ matrix.build_type }} -j $(nproc)
- name: Test
id: cmake_test
run: |
@@ -678,8 +688,6 @@ jobs:
env:
OPENBLAS_VERSION: 0.3.23
OPENCL_VERSION: 2023.04.17
CLBLAST_VERSION: 1.6.0
SDE_VERSION: 9.33.0-2024-01-07
VULKAN_VERSION: 1.3.261.1
@@ -696,8 +704,6 @@ jobs:
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX2=OFF -DBUILD_SHARED_LIBS=ON'
- build: 'avx512-x64'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX512=ON -DBUILD_SHARED_LIBS=ON'
- build: 'clblast-x64'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON -DBUILD_SHARED_LIBS=ON -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/clblast"'
- build: 'openblas-x64'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_BLAS=ON -DBUILD_SHARED_LIBS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS="$env:RUNNER_TEMP/openblas/include" -DBLAS_LIBRARIES="$env:RUNNER_TEMP/openblas/lib/openblas.lib"'
- build: 'kompute-x64'
@@ -722,27 +728,6 @@ jobs:
run: |
git submodule update --init kompute
- name: Download OpenCL SDK
id: get_opencl
if: ${{ matrix.build == 'clblast-x64' }}
run: |
curl.exe -o $env:RUNNER_TEMP/opencl.zip -L "https://github.com/KhronosGroup/OpenCL-SDK/releases/download/v${env:OPENCL_VERSION}/OpenCL-SDK-v${env:OPENCL_VERSION}-Win-x64.zip"
mkdir $env:RUNNER_TEMP/opencl
tar.exe -xvf $env:RUNNER_TEMP/opencl.zip --strip-components=1 -C $env:RUNNER_TEMP/opencl
- name: Download CLBlast
id: get_clblast
if: ${{ matrix.build == 'clblast-x64' }}
run: |
curl.exe -o $env:RUNNER_TEMP/clblast.7z -L "https://github.com/CNugteren/CLBlast/releases/download/${env:CLBLAST_VERSION}/CLBlast-${env:CLBLAST_VERSION}-windows-x64.7z"
curl.exe -o $env:RUNNER_TEMP/CLBlast.LICENSE.txt -L "https://github.com/CNugteren/CLBlast/raw/${env:CLBLAST_VERSION}/LICENSE"
7z x "-o${env:RUNNER_TEMP}" $env:RUNNER_TEMP/clblast.7z
rename-item $env:RUNNER_TEMP/CLBlast-${env:CLBLAST_VERSION}-windows-x64 clblast
foreach ($f in (gci -Recurse -Path "$env:RUNNER_TEMP/clblast" -Filter '*.cmake')) {
$txt = Get-Content -Path $f -Raw
$txt.Replace('C:/vcpkg/packages/opencl_x64-windows/', "$($env:RUNNER_TEMP.Replace('\','/'))/opencl/") | Set-Content -Path $f -Encoding UTF8
}
- name: Download OpenBLAS
id: get_openblas
if: ${{ matrix.build == 'openblas-x64' }}
@@ -776,13 +761,6 @@ jobs:
cmake -S . -B build ${{ matrix.defines }}
cmake --build build --config Release -j ${env:NUMBER_OF_PROCESSORS}
- name: Add clblast.dll
id: add_clblast_dll
if: ${{ matrix.build == 'clblast-x64' }}
run: |
cp $env:RUNNER_TEMP/clblast/lib/clblast.dll ./build/bin/Release
cp $env:RUNNER_TEMP/CLBlast.LICENSE.txt ./build/bin/Release/CLBlast-${env:CLBLAST_VERSION}.txt
- name: Add libopenblas.dll
id: add_libopenblas_dll
if: ${{ matrix.build == 'openblas-x64' }}
@@ -806,7 +784,7 @@ jobs:
- name: Test
id: cmake_test
# not all machines have native AVX-512
if: ${{ matrix.build != 'msvc-arm64' && matrix.build != 'llvm-arm64' && matrix.build != 'clblast-x64' && matrix.build != 'kompute-x64' && matrix.build != 'vulkan-x64' && (matrix.build != 'avx512-x64' || env.HAS_AVX512F == '1') }}
if: ${{ matrix.build != 'msvc-arm64' && matrix.build != 'llvm-arm64' && matrix.build != 'kompute-x64' && matrix.build != 'vulkan-x64' && (matrix.build != 'avx512-x64' || env.HAS_AVX512F == '1') }}
run: |
cd build
ctest -L main -C Release --verbose --timeout 900
@@ -1061,7 +1039,7 @@ jobs:
# hypervisor: 'qemu'
# run: |
# sudo pkg update
# sudo pkg install -y gmake automake autoconf pkgconf llvm15 clinfo clover opencl clblast openblas
# sudo pkg install -y gmake automake autoconf pkgconf llvm15 openblas
# gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j `sysctl -n hw.ncpu`
release:

3
.gitignore vendored
View File

@@ -34,9 +34,11 @@ ggml-metal-embed.metal
lcov-report/
gcovr-report/
tags
build*
!build.zig
cmake-build-*
android-ndk-*
out/
tmp/
@@ -105,6 +107,7 @@ examples/jeopardy/results.txt
examples/server/*.html.hpp
examples/server/*.js.hpp
examples/server/*.mjs.hpp
examples/server/*.css.hpp
poetry.lock
poetry.toml

View File

@@ -106,11 +106,11 @@ set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "128" CACHE STRING
"llama: max. batch size for using peer access")
option(LLAMA_CUDA_NO_PEER_COPY "llama: do not use peer to peer copies" OFF)
option(LLAMA_CUDA_NO_VMM "llama: do not try to use CUDA VMM" OFF)
option(LLAMA_CUDA_FA_ALL_QUANTS "llama: compile all quants for FlashAttention" OFF)
option(LLAMA_CURL "llama: use libcurl to download model from an URL" OFF)
option(LLAMA_HIPBLAS "llama: use hipBLAS" OFF)
option(LLAMA_HIP_UMA "llama: use HIP unified memory architecture" OFF)
option(LLAMA_CLBLAST "llama: use CLBlast" OFF)
option(LLAMA_VULKAN "llama: use Vulkan" OFF)
option(LLAMA_VULKAN_CHECK_RESULTS "llama: run Vulkan op checks" OFF)
option(LLAMA_VULKAN_DEBUG "llama: enable Vulkan debug output" OFF)
@@ -125,6 +125,7 @@ set(LLAMA_METAL_MACOSX_VERSION_MIN "" CACHE STRING
set(LLAMA_METAL_STD "" CACHE STRING "llama: metal standard version (-std flag)")
option(LLAMA_KOMPUTE "llama: use Kompute" OFF)
option(LLAMA_RPC "llama: use RPC" OFF)
option(LLAMA_OPENMP "llama: use OpenMP" ON)
option(LLAMA_SYCL "llama: use SYCL" OFF)
option(LLAMA_SYCL_F16 "llama: use 16 bit floats for sycl calculations" OFF)
set(LLAMA_SYCL_TARGET "INTEL" CACHE STRING "llama: sycl target device")
@@ -295,6 +296,17 @@ if (LLAMA_METAL)
)
endif()
if (LLAMA_OPENMP)
find_package(OpenMP)
if (OpenMP_FOUND)
message(STATUS "OpenMP found")
add_compile_definitions(GGML_USE_OPENMP)
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} OpenMP::OpenMP_C OpenMP::OpenMP_CXX)
else()
message(WARNING "OpenMP not found")
endif()
endif()
if (LLAMA_BLAS)
if (LLAMA_STATIC)
set(BLA_STATIC ON)
@@ -402,6 +414,10 @@ if (LLAMA_CUDA)
file(GLOB GGML_SOURCES_CUDA "ggml-cuda/*.cu")
list(APPEND GGML_SOURCES_CUDA "ggml-cuda.cu")
file(GLOB SRCS "ggml-cuda/template-instances/fattn-wmma*.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/mmq*.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
add_compile_definitions(GGML_USE_CUDA)
add_compile_definitions(GGML_CUDA_USE_GRAPHS)
@@ -427,6 +443,18 @@ if (LLAMA_CUDA)
if (LLAMA_CUDA_NO_PEER_COPY)
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
endif()
if (LLAMA_CUDA_FA_ALL_QUANTS)
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
add_compile_definitions(GGML_CUDA_FA_ALL_QUANTS)
else()
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*q4_0-q4_0.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*q8_0-q8_0.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*f16-f16.cu")
list(APPEND GGML_SOURCES_CUDA ${SRCS})
endif()
if (LLAMA_STATIC)
if (WIN32)
@@ -475,22 +503,6 @@ if (LLAMA_RPC)
set(GGML_SOURCES_RPC ggml-rpc.cpp)
endif()
if (LLAMA_CLBLAST)
find_package(CLBlast)
if (CLBlast_FOUND)
message(STATUS "CLBlast found")
set(GGML_HEADERS_OPENCL ggml-opencl.h)
set(GGML_SOURCES_OPENCL ggml-opencl.cpp)
add_compile_definitions(GGML_USE_CLBLAST)
set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} clblast)
else()
message(WARNING "CLBlast not found")
endif()
endif()
if (LLAMA_VULKAN)
find_package(Vulkan)
if (Vulkan_FOUND)
@@ -530,12 +542,17 @@ if (LLAMA_VULKAN)
endif()
if (LLAMA_HIPBLAS)
if ($ENV{ROCM_PATH})
set(ROCM_PATH $ENV{ROCM_PATH})
if (NOT EXISTS $ENV{ROCM_PATH})
if (NOT EXISTS /opt/rocm)
set(ROCM_PATH /usr)
else()
set(ROCM_PATH /opt/rocm)
endif()
else()
set(ROCM_PATH /opt/rocm)
set(ROCM_PATH $ENV{ROCM_PATH})
endif()
list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH})
list(APPEND CMAKE_PREFIX_PATH "${ROCM_PATH}/lib64/cmake")
# CMake on Windows doesn't support the HIP language yet
if(WIN32)
@@ -571,6 +588,10 @@ if (LLAMA_HIPBLAS)
file(GLOB GGML_SOURCES_ROCM "ggml-cuda/*.cu")
list(APPEND GGML_SOURCES_ROCM "ggml-cuda.cu")
file(GLOB SRCS "ggml-cuda/template-instances/fattn-wmma*.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/mmq*.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
add_compile_definitions(GGML_USE_HIPBLAS GGML_USE_CUDA)
@@ -590,6 +611,19 @@ if (LLAMA_HIPBLAS)
add_compile_definitions(GGML_CUDA_NO_PEER_COPY)
endif()
if (LLAMA_CUDA_FA_ALL_QUANTS)
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
add_compile_definitions(GGML_CUDA_FA_ALL_QUANTS)
else()
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*q4_0-q4_0.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*q8_0-q8_0.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
file(GLOB SRCS "ggml-cuda/template-instances/fattn-vec*f16-f16.cu")
list(APPEND GGML_SOURCES_ROCM ${SRCS})
endif()
add_compile_definitions(GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X})
add_compile_definitions(GGML_CUDA_MMV_Y=${LLAMA_CUDA_MMV_Y})
add_compile_definitions(K_QUANTS_PER_ITERATION=${LLAMA_CUDA_KQUANTS_ITER})
@@ -747,6 +781,7 @@ if (LLAMA_KOMPUTE)
kompute-shaders/op_mul_mat_q4_0.comp
kompute-shaders/op_mul_mat_q4_1.comp
kompute-shaders/op_mul_mat_q6_k.comp
kompute-shaders/op_getrows_f32.comp
kompute-shaders/op_getrows_f16.comp
kompute-shaders/op_getrows_q4_0.comp
kompute-shaders/op_getrows_q4_1.comp
@@ -779,6 +814,7 @@ if (LLAMA_KOMPUTE)
shaderop_mul_mat_q4_0.h
shaderop_mul_mat_q4_1.h
shaderop_mul_mat_q6_k.h
shaderop_getrows_f32.h
shaderop_getrows_f16.h
shaderop_getrows_q4_0.h
shaderop_getrows_q4_1.h
@@ -1216,7 +1252,6 @@ add_library(ggml OBJECT
ggml-quants.c
ggml-quants.h
${GGML_SOURCES_CUDA} ${GGML_HEADERS_CUDA}
${GGML_SOURCES_OPENCL} ${GGML_HEADERS_OPENCL}
${GGML_SOURCES_METAL} ${GGML_HEADERS_METAL}
${GGML_SOURCES_RPC} ${GGML_HEADERS_RPC}
${GGML_SOURCES_EXTRA} ${GGML_HEADERS_EXTRA}
@@ -1304,8 +1339,9 @@ install(FILES ${CMAKE_CURRENT_BINARY_DIR}/LlamaConfig.cmake
DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/Llama)
set(GGML_PUBLIC_HEADERS "ggml.h" "ggml-alloc.h" "ggml-backend.h"
"${GGML_HEADERS_CUDA}" "${GGML_HEADERS_OPENCL}"
"${GGML_HEADERS_METAL}" "${GGML_HEADERS_EXTRA}")
"${GGML_HEADERS_CUDA}"
"${GGML_HEADERS_METAL}"
"${GGML_HEADERS_EXTRA}")
set_target_properties(ggml PROPERTIES PUBLIC_HEADER "${GGML_PUBLIC_HEADERS}")
install(TARGETS ggml PUBLIC_HEADER)
@@ -1341,6 +1377,13 @@ if (LLAMA_METAL)
endif()
endif()
configure_file(cmake/llama.pc.in
"${CMAKE_CURRENT_BINARY_DIR}/llama.pc"
@ONLY)
install(FILES "${CMAKE_CURRENT_BINARY_DIR}/llama.pc"
DESTINATION lib/pkgconfig)
#
# programs, examples and tests
#

14
CONTRIBUTING.md Normal file
View File

@@ -0,0 +1,14 @@
# Contributing Guidelines
## Checklist
* Make sure your PR follows the [coding guidelines](https://github.com/ggerganov/llama.cpp/blob/master/README.md#coding-guidelines)
* Test your changes using the commands in the [`tests`](tests) folder. For instance, running the `./tests/test-backend-ops` command tests different backend implementations of the GGML library
* Execute [the full CI locally on your machine](ci/README.md) before publishing
## PR formatting
* Please rate the complexity of your PR (i.e. `Review Complexity : Low`, `Review Complexity : Medium`, `Review Complexity : High`). This makes it easier for maintainers to triage the PRs.
- The PR template has a series of review complexity checkboxes `[ ]` that you can mark as `[X]` for your conveience. Refer to [About task lists](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists) for more information.
* If the pull request only contains documentation changes (e.g., updating READMEs, adding new wiki pages), please add `[no ci]` to the commit title. This will skip unnecessary CI checks and help reduce build times.
* When squashing multiple commits on merge, use the following format for your commit title: `<module> : <commit title> (#<issue_number>)`. For example: `utils : Fix typo in utils.py (#1234)`

View File

@@ -1,7 +1,7 @@
# Define the default target now so that it is always the first target
BUILD_TARGETS = \
main quantize quantize-stats perplexity imatrix embedding vdot q8dot train-text-from-scratch convert-llama2c-to-ggml \
simple batched batched-bench save-load-state server gguf gguf-split eval-callback llama-bench libllava.a llava-cli baby-llama beam-search \
simple batched batched-bench save-load-state server gguf gguf-split eval-callback llama-bench libllava.a llava-cli baby-llama \
retrieval speculative infill tokenize benchmark-matmult parallel finetune export-lora lookahead lookup passkey gritlm tests/test-c.o
# Binaries only useful for tests
@@ -57,6 +57,8 @@ ifeq ($(UNAME_S),Darwin)
LLAMA_METAL := 1
endif
LLAMA_NO_OPENMP := 1
ifneq ($(UNAME_P),arm)
SYSCTL_M := $(shell sysctl -n hw.optional.arm64 2>/dev/null)
ifeq ($(SYSCTL_M),1)
@@ -67,6 +69,10 @@ ifeq ($(UNAME_S),Darwin)
endif
endif
ifdef LLAMA_RPC
BUILD_TARGETS += rpc-server
endif
default: $(BUILD_TARGETS)
test: $(TEST_TARGETS)
@@ -135,12 +141,16 @@ MK_NVCCFLAGS = -std=c++11
ifdef LLAMA_FAST
MK_CFLAGS += -Ofast
HOST_CXXFLAGS += -Ofast
ifndef LLAMA_DEBUG
MK_NVCCFLAGS += -O3
endif # LLAMA_DEBUG
else
MK_CFLAGS += -O3
MK_CXXFLAGS += -O3
ifndef LLAMA_DEBUG
MK_NVCCFLAGS += -O3
endif
endif # LLAMA_DEBUG
endif # LLAMA_FAST
ifndef LLAMA_NO_CCACHE
CCACHE := $(shell which ccache)
@@ -201,9 +211,10 @@ ifdef LLAMA_SCHED_MAX_COPIES
endif
ifdef LLAMA_DEBUG
MK_CFLAGS += -O0 -g
MK_CXXFLAGS += -O0 -g
MK_LDFLAGS += -g
MK_CFLAGS += -O0 -g
MK_CXXFLAGS += -O0 -g
MK_LDFLAGS += -g
MK_NVCCFLAGS += -O0 -g
ifeq ($(UNAME_S),Linux)
MK_CPPFLAGS += -D_GLIBCXX_ASSERTIONS
@@ -400,6 +411,12 @@ ifndef LLAMA_NO_ACCELERATE
endif
endif # LLAMA_NO_ACCELERATE
ifndef LLAMA_NO_OPENMP
MK_CPPFLAGS += -DGGML_USE_OPENMP
MK_CFLAGS += -fopenmp
MK_CXXFLAGS += -fopenmp
endif # LLAMA_NO_OPENMP
ifdef LLAMA_OPENBLAS
MK_CPPFLAGS += -DGGML_USE_OPENBLAS $(shell pkg-config --cflags-only-I openblas)
MK_CFLAGS += $(shell pkg-config --cflags-only-other openblas)
@@ -416,11 +433,26 @@ ifdef LLAMA_BLIS
MK_LDFLAGS += -lblis -L/usr/local/lib
endif # LLAMA_BLIS
ifdef LLAMA_RPC
MK_CPPFLAGS += -DGGML_USE_RPC
OBJS += ggml-rpc.o
endif # LLAMA_RPC
ifdef LLAMA_CUBLAS
# LLAMA_CUBLAS is deprecated and will be removed in the future
LLAMA_CUDA := 1
endif
OBJS_CUDA_TEMP_INST = $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/fattn-wmma*.cu))
OBJS_CUDA_TEMP_INST += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/mmq*.cu))
ifdef LLAMA_CUDA_FA_ALL_QUANTS
OBJS_CUDA_TEMP_INST += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/fattn-vec*.cu))
else
OBJS_CUDA_TEMP_INST += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/fattn-vec*q4_0-q4_0.cu))
OBJS_CUDA_TEMP_INST += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/fattn-vec*q8_0-q8_0.cu))
OBJS_CUDA_TEMP_INST += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/template-instances/fattn-vec*f16-f16.cu))
endif # LLAMA_CUDA_FA_ALL_QUANTS
ifdef LLAMA_CUDA
ifneq ('', '$(wildcard /opt/cuda)')
CUDA_PATH ?= /opt/cuda
@@ -431,6 +463,7 @@ ifdef LLAMA_CUDA
MK_LDFLAGS += -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L$(CUDA_PATH)/lib64 -L/usr/lib64 -L$(CUDA_PATH)/targets/$(UNAME_M)-linux/lib -L/usr/lib/wsl/lib
OBJS += ggml-cuda.o
OBJS += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/*.cu))
OBJS += $(OBJS_CUDA_TEMP_INST)
MK_NVCCFLAGS += -use_fast_math
ifdef LLAMA_FATAL_WARNINGS
MK_NVCCFLAGS += -Werror all-warnings
@@ -493,7 +526,10 @@ ifdef LLAMA_CUDA_NO_PEER_COPY
endif # LLAMA_CUDA_NO_PEER_COPY
ifdef LLAMA_CUDA_CCBIN
MK_NVCCFLAGS += -ccbin $(LLAMA_CUDA_CCBIN)
endif
endif # LLAMA_CUDA_CCBIN
ifdef LLAMA_CUDA_FA_ALL_QUANTS
MK_NVCCFLAGS += -DGGML_CUDA_FA_ALL_QUANTS
endif # LLAMA_CUDA_FA_ALL_QUANTS
ifdef JETSON_EOL_MODULE_DETECT
define NVCC_COMPILE
@@ -505,30 +541,13 @@ define NVCC_COMPILE
endef # NVCC_COMPILE
endif # JETSON_EOL_MODULE_DETECT
ggml-cuda/%.o: ggml-cuda/%.cu ggml-cuda/%.cuh ggml.h ggml-common.h ggml-cuda/common.cuh
ggml-cuda/%.o: ggml-cuda/%.cu ggml.h ggml-common.h ggml-cuda/common.cuh
$(NVCC_COMPILE)
ggml-cuda.o: ggml-cuda.cu ggml-cuda.h ggml.h ggml-backend.h ggml-backend-impl.h ggml-common.h $(wildcard ggml-cuda/*.cuh)
$(NVCC_COMPILE)
endif # LLAMA_CUDA
ifdef LLAMA_CLBLAST
MK_CPPFLAGS += -DGGML_USE_CLBLAST $(shell pkg-config --cflags-only-I clblast OpenCL)
MK_CFLAGS += $(shell pkg-config --cflags-only-other clblast OpenCL)
MK_CXXFLAGS += $(shell pkg-config --cflags-only-other clblast OpenCL)
# Mac provides OpenCL as a framework
ifeq ($(UNAME_S),Darwin)
MK_LDFLAGS += -lclblast -framework OpenCL
else
MK_LDFLAGS += $(shell pkg-config --libs clblast OpenCL)
endif
OBJS += ggml-opencl.o
ggml-opencl.o: ggml-opencl.cpp ggml-opencl.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif # LLAMA_CLBLAST
ifdef LLAMA_VULKAN
MK_CPPFLAGS += -DGGML_USE_VULKAN
MK_LDFLAGS += -lvulkan
@@ -571,6 +590,7 @@ ifdef LLAMA_HIP_UMA
MK_CPPFLAGS += -DGGML_HIP_UMA
endif # LLAMA_HIP_UMA
MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas
HIPFLAGS += $(addprefix --offload-arch=,$(AMDGPU_TARGETS))
HIPFLAGS += -DGGML_CUDA_DMMV_X=$(LLAMA_CUDA_DMMV_X)
@@ -584,11 +604,12 @@ ifdef LLAMA_CUDA_NO_PEER_COPY
endif # LLAMA_CUDA_NO_PEER_COPY
OBJS += ggml-cuda.o
OBJS += $(patsubst %.cu,%.o,$(wildcard ggml-cuda/*.cu))
OBJS += $(OBJS_CUDA_TEMP_INST)
ggml-cuda.o: ggml-cuda.cu ggml-cuda.h ggml.h ggml-backend.h ggml-backend-impl.h ggml-common.h $(wildcard ggml-cuda/*.cuh)
$(HIPCC) $(CXXFLAGS) $(HIPFLAGS) -x hip -c -o $@ $<
ggml-cuda/%.o: ggml-cuda/%.cu ggml-cuda/%.cuh ggml.h ggml-common.h ggml-cuda/common.cuh
ggml-cuda/%.o: ggml-cuda/%.cu ggml.h ggml-common.h ggml-cuda/common.cuh
$(HIPCC) $(CXXFLAGS) $(HIPFLAGS) -x hip -c -o $@ $<
endif # LLAMA_HIPBLAS
@@ -626,11 +647,26 @@ ggml-metal-embed.o: ggml-metal.metal ggml-common.h
endif
endif # LLAMA_METAL
OBJS += ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o
COMMON_H_DEPS = common/common.h common/sampling.h common/log.h llama.h
COMMON_DEPS = common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o
ifndef LLAMA_NO_LLAMAFILE
sgemm.o: sgemm.cpp sgemm.h ggml.h
$(CXX) $(CXXFLAGS) -c $< -o $@
endif
ifdef LLAMA_RPC
ggml-rpc.o: ggml-rpc.cpp ggml-rpc.h
$(CXX) $(CXXFLAGS) -c $< -o $@
rpc-server.o: examples/rpc/rpc-server.cpp ggml-rpc.h
$(CXX) $(CXXFLAGS) -c $< -o $@
rpc-server: rpc-server.o ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)
endif # LLAMA_RPC
GF_CC := $(CC)
include scripts/get-flags.mk
@@ -710,14 +746,9 @@ unicode.o: unicode.cpp unicode.h
unicode-data.o: unicode-data.cpp unicode-data.h
$(CXX) $(CXXFLAGS) -c $< -o $@
OBJS += ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o
llama.o: llama.cpp unicode.h ggml.h ggml-alloc.h ggml-backend.h ggml-cuda.h ggml-metal.h llama.h
$(CXX) $(CXXFLAGS) -c $< -o $@
COMMON_H_DEPS = common/common.h common/sampling.h common/log.h llama.h
COMMON_DEPS = common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o
common.o: common/common.cpp $(COMMON_H_DEPS)
$(CXX) $(CXXFLAGS) -c $< -o $@
@@ -748,6 +779,7 @@ libllama.a: llama.o ggml.o $(OBJS) $(COMMON_DEPS)
clean:
rm -vrf *.o tests/*.o *.so *.a *.dll benchmark-matmult lookup-create lookup-merge lookup-stats common/build-info.cpp *.dot $(COV_TARGETS) $(BUILD_TARGETS) $(TEST_TARGETS)
rm -vrf ggml-cuda/*.o
rm -vrf ggml-cuda/template-instances/*.o
find examples pocs -type f -name "*.o" -delete
#
@@ -816,7 +848,7 @@ save-load-state: examples/save-load-state/save-load-state.cpp ggml.o llama.o $(C
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
server: examples/server/server.cpp examples/server/utils.hpp examples/server/httplib.h common/json.hpp examples/server/index.html.hpp examples/server/index.js.hpp examples/server/completion.js.hpp examples/server/json-schema-to-grammar.mjs.hpp common/stb_image.h ggml.o llama.o $(COMMON_DEPS) grammar-parser.o $(OBJS)
server: examples/server/server.cpp examples/server/utils.hpp examples/server/httplib.h common/json.hpp examples/server/colorthemes.css.hpp examples/server/style.css.hpp examples/server/theme-beeninorder.css.hpp examples/server/theme-ketivah.css.hpp examples/server/theme-mangotango.css.hpp examples/server/theme-playground.css.hpp examples/server/theme-polarnight.css.hpp examples/server/theme-snowstorm.css.hpp examples/server/index.html.hpp examples/server/index-new.html.hpp examples/server/index.js.hpp examples/server/completion.js.hpp examples/server/system-prompts.js.hpp examples/server/prompt-formats.js.hpp examples/server/json-schema-to-grammar.mjs.hpp common/stb_image.h ggml.o llama.o $(COMMON_DEPS) grammar-parser.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)
@@ -866,10 +898,6 @@ baby-llama: examples/baby-llama/baby-llama.cpp ggml.o llama.o $(COMMON_DEPS) tra
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
beam-search: examples/beam-search/beam-search.cpp ggml.o llama.o $(COMMON_DEPS) $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
finetune: examples/finetune/finetune.cpp ggml.o llama.o $(COMMON_DEPS) train.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)

View File

@@ -29,7 +29,7 @@ The llama.cpp SYCL backend is designed to support **Intel GPU** firstly. Based o
When targeting **Intel CPU**, it is recommended to use llama.cpp for [Intel oneMKL](README.md#intel-onemkl) backend.
It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, CLBlast etc..*. In beginning work, the oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.
It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, etc..*. In beginning work, the oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.
## News

175
README.md
View File

@@ -2,7 +2,9 @@
![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg?branch=master&event=schedule)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Server](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml/badge.svg?branch=master&event=schedule)](https://github.com/ggerganov/llama.cpp/actions/workflows/server.yml)
[![Conan Center](https://shields.io/conan/v/llama-cpp)](https://conan.io/center/llama-cpp)
[Roadmap](https://github.com/users/ggerganov/projects/7) / [Project status](https://github.com/ggerganov/llama.cpp/discussions/3471) / [Manifesto](https://github.com/ggerganov/llama.cpp/discussions/205) / [ggml](https://github.com/ggerganov/ggml)
@@ -20,7 +22,8 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
### Hot topics
- **Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021**
- **`convert.py` has been deprecated and moved to `examples/convert-legacy-llama.py`, please use `convert-hf-to-gguf.py`** https://github.com/ggerganov/llama.cpp/pull/7430
- Initial Flash-Attention support: https://github.com/ggerganov/llama.cpp/pull/5021
- BPE pre-tokenization support has been added: https://github.com/ggerganov/llama.cpp/pull/6920
- MoE memory layout has been updated - reconvert models for `mmap` support and regenerate `imatrix` https://github.com/ggerganov/llama.cpp/pull/6387
- Model sharding instructions using `gguf-split` https://github.com/ggerganov/llama.cpp/discussions/6404
@@ -50,7 +53,6 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
<li><a href="#quantization">Quantization</a></li>
<li><a href="#interactive-mode">Interactive mode</a></li>
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
<li><a href="#instruct-mode">Instruct mode</a></li>
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
@@ -74,7 +76,7 @@ variety of hardware - locally and in the cloud.
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Vulkan, SYCL, and (partial) OpenCL backend support
- Vulkan and SYCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
Since its [inception](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022), the project has
@@ -147,6 +149,8 @@ Typically finetunes of the base models below are supported as well.
[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
[simplechat](./examples/server/public_simplechat) is a simple chat client, which can be used to chat with the model exposed using above web server (use --path to point to simplechat), from a local web browser.
**Bindings:**
- Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
@@ -200,6 +204,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
- [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
- [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
- [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
- [AIKit](https://github.com/sozercan/aikit) (MIT)
*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
@@ -358,17 +363,6 @@ In order to build llama.cpp you have four different options.
cmake --build build --config Debug
```
- Using `Zig` (version 0.11 or later):
Building for optimization levels and CPU features can be accomplished using standard build arguments, for example AVX2, FMA, F16C,
it's also possible to cross compile for other operating systems and architectures:
```bash
zig build -Doptimize=ReleaseFast -Dtarget=x86_64-windows-gnu -Dcpu=x86_64+avx2+fma+f16c
```
The `zig targets` command will give you valid options to use.
- Using `gmake` (FreeBSD):
1. Install and activate [DRM in FreeBSD](https://wiki.freebsd.org/Graphics)
@@ -376,15 +370,18 @@ In order to build llama.cpp you have four different options.
3. Install compilation dependencies.
```bash
sudo pkg install gmake automake autoconf pkgconf llvm15 clinfo clover \
opencl clblast openblas
sudo pkg install gmake automake autoconf pkgconf llvm15 openblas
gmake CC=/usr/local/bin/clang15 CXX=/usr/local/bin/clang++15 -j4
```
**Notes:** With this packages you can build llama.cpp with OPENBLAS and
CLBLAST support for use OpenCL GPU acceleration in FreeBSD. Please read
the instructions for use and activate this options in this document below.
### Homebrew
On Mac and Linux, the homebrew package manager can be used via
```
brew install llama.cpp
```
The formula is automatically updated with new `llama.cpp` releases. More info: https://github.com/ggerganov/llama.cpp/discussions/7668
### Metal Build
@@ -396,7 +393,7 @@ argument.
### BLAS Build
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS and CLBlast. There are currently several different BLAS implementations available for build and use:
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS. There are currently several different BLAS implementations available for build and use:
- #### Accelerate Framework:
@@ -489,6 +486,7 @@ Building the program with BLAS support may lead to some performance improvements
| LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
| LLAMA_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
| LLAMA_CUDA_FA_ALL_QUANTS | Boolean | false | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer. |
- #### hipBLAS
@@ -549,111 +547,6 @@ Building the program with BLAS support may lead to some performance improvements
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
- #### CLBlast
OpenCL acceleration is provided by the matrix multiplication kernels from the [CLBlast](https://github.com/CNugteren/CLBlast) project and custom kernels for ggml that can generate tokens on the GPU.
You will need the [OpenCL SDK](https://github.com/KhronosGroup/OpenCL-SDK).
- For Ubuntu, Debian, and Fedora the packages `opencl-headers`, `ocl-icd` may be needed.
- For Windows, a pre-built SDK is available on the [OpenCL Releases](https://github.com/KhronosGroup/OpenCL-SDK/releases) page.
- <details>
<summary>Installing the OpenCL SDK from source</summary>
```sh
git clone --recurse-submodules https://github.com/KhronosGroup/OpenCL-SDK.git
cd OpenCL-SDK
cmake -B build -DBUILD_DOCS=OFF \
-DBUILD_EXAMPLES=OFF \
-DBUILD_TESTING=OFF \
-DOPENCL_SDK_BUILD_SAMPLES=OFF \
-DOPENCL_SDK_TEST_SAMPLES=OFF
cmake --build build
cmake --install build --prefix /some/path
```
</details>
##### Installing CLBlast
Pre-built CLBlast binaries may be found on the [CLBlast Releases](https://github.com/CNugteren/CLBlast/releases) page. For Unix variants, it may also be found in your operating system's packages.
Linux packaging:
Fedora Linux:
```bash
sudo dnf install clblast
```
Alternatively, they may be built from source.
- <details>
<summary>Windows:</summary>
```cmd
set OPENCL_SDK_ROOT="C:/OpenCL-SDK-v2023.04.17-Win-x64"
git clone https://github.com/CNugteren/CLBlast.git
cd CLBlast
cmake -B build -DBUILD_SHARED_LIBS=OFF -DOVERRIDE_MSVC_FLAGS_TO_MT=OFF -DTUNERS=OFF -DOPENCL_ROOT=%OPENCL_SDK_ROOT% -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/CLBlast
```
(note: `--config Release` at build time is the default and only relevant for Visual Studio builds - or multi-config Ninja builds)
- <details>
<summary>Unix:</summary>
```sh
git clone https://github.com/CNugteren/CLBlast.git
cd CLBlast
cmake -B build -DBUILD_SHARED_LIBS=OFF -DTUNERS=OFF
cmake --build build --config Release
cmake --install build --prefix /some/path
```
Where `/some/path` is where the built library will be installed (default is `/usr/local`).
</details>
##### Building Llama with CLBlast
- Build with make:
```sh
make LLAMA_CLBLAST=1
```
- CMake (Unix):
```sh
cmake -B build -DLLAMA_CLBLAST=ON -DCLBlast_DIR=/some/path
cmake --build build --config Release
```
- CMake (Windows):
```cmd
set CL_BLAST_CMAKE_PKG="C:/CLBlast/lib/cmake/CLBlast"
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=%CL_BLAST_CMAKE_PKG% -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/LlamaCPP
```
##### Running Llama with CLBlast
The CLBlast build supports `--gpu-layers|-ngl` like the CUDA version does.
To select the correct platform (driver) and device (GPU), you can use the environment variables `GGML_OPENCL_PLATFORM` and `GGML_OPENCL_DEVICE`.
The selection can be a number (starting from 0) or a text string to search:
```sh
GGML_OPENCL_PLATFORM=1 ./main ...
GGML_OPENCL_DEVICE=2 ./main ...
GGML_OPENCL_PLATFORM=Intel ./main ...
GGML_OPENCL_PLATFORM=AMD GGML_OPENCL_DEVICE=1 ./main ...
```
The default behavior is to find the first GPU device, but when it is an integrated GPU on a laptop, for instance, the selectors are useful.
Using the variables it is possible to select a CPU-based driver as well, if so desired.
You can get a list of platforms and devices from the `clinfo -l` command, etc.
- #### Vulkan
**With docker**:
@@ -704,7 +597,7 @@ Building the program with BLAS support may lead to some performance improvements
To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
Note: `convert.py` has been moved to `examples/convert-legacy-llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derievatives.
Note: `convert.py` has been moved to `examples/convert-legacy-llama.py` and shouldn't be used for anything other than `Llama/Llama2/Mistral` models and their derivatives.
It does not support LLaMA 3, you can use `convert-hf-to-gguf.py` with LLaMA 3 downloaded from Hugging Face.
```bash
@@ -875,34 +768,6 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
### Instruct mode
1. First, download and place the `ggml` model into the `./models` folder
2. Run the `main` tool like this:
```
./examples/alpaca.sh
```
Sample run:
```
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMA.
- If you want to submit another line, end your input in '\'.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
> How many letters are there in the English alphabet?
There 26 letters in the English Alphabet
> What is the most common way of transportation in Amsterdam?
The majority (54%) are using public transit. This includes buses, trams and metros with over 100 lines throughout the city which make it very accessible for tourists to navigate around town as well as locals who commute by tram or metro on a daily basis
> List 5 words that start with "ca".
cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
>
```
### Obtaining and using the Facebook LLaMA 2 model
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.

View File

@@ -9,7 +9,7 @@ set( CMAKE_CXX_COMPILER clang++ )
set( CMAKE_C_COMPILER_TARGET ${target} )
set( CMAKE_CXX_COMPILER_TARGET ${target} )
set( arch_c_flags "-march=armv8.7-a -fvectorize -ffp-model=fast" )
set( arch_c_flags "-march=armv8.7-a -fvectorize -ffp-model=fast -fno-finite-math-only" )
set( warn_c_flags "-Wno-format -Wno-unused-variable -Wno-unused-function -Wno-gnu-zero-variadic-macro-arguments" )
set( CMAKE_C_FLAGS_INIT "${arch_c_flags} ${warn_c_flags}" )

10
cmake/llama.pc.in Normal file
View File

@@ -0,0 +1,10 @@
prefix=@CMAKE_INSTALL_PREFIX@
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include
Name: llama
Description: Port of Facebook's LLaMA model in C/C++
Version: @PROJECT_VERSION@
Libs: -L${libdir} -lllama
Cflags: -I${includedir}

View File

@@ -84,4 +84,4 @@ endif ()
target_include_directories(${TARGET} PUBLIC .)
target_compile_features(${TARGET} PUBLIC cxx_std_11)
target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama)
target_link_libraries(${TARGET} PRIVATE ${LLAMA_COMMON_EXTRA_LIBS} PUBLIC llama Threads::Threads)

File diff suppressed because it is too large Load Diff

View File

@@ -56,66 +56,67 @@ struct gpt_params {
uint32_t seed = LLAMA_DEFAULT_SEED; // RNG seed
int32_t n_threads = cpu_get_num_math();
int32_t n_threads_draft = -1;
int32_t n_threads_batch = -1; // number of threads to use for batch processing (-1 = use n_threads)
int32_t n_threads_batch_draft = -1;
int32_t n_predict = -1; // new tokens to predict
int32_t n_ctx = 512; // context size
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_ubatch = 512; // physical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
int32_t n_draft = 5; // number of tokens to draft during speculative decoding
int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited)
int32_t n_parallel = 1; // number of parallel sequences to decode
int32_t n_sequences = 1; // number of sequences to decode
float p_split = 0.1f; // speculative decoding split probability
int32_t n_gpu_layers = -1; // number of layers to store in VRAM (-1 - use default)
int32_t n_gpu_layers_draft = -1; // number of layers to store in VRAM for the draft model (-1 - use default)
llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
int32_t n_beams = 0; // if non-zero then use beam search of given width.
int32_t grp_attn_n = 1; // group-attention factor
int32_t grp_attn_w = 512; // group-attention width
int32_t n_print = -1; // print token count every n tokens (-1 = disabled)
float rope_freq_base = 0.0f; // RoPE base frequency
float rope_freq_scale = 0.0f; // RoPE frequency scaling factor
int32_t n_threads_draft = -1;
int32_t n_threads_batch = -1; // number of threads to use for batch processing (-1 = use n_threads)
int32_t n_threads_batch_draft = -1;
int32_t n_predict = -1; // new tokens to predict
int32_t n_ctx = 0; // context size
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_ubatch = 512; // physical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
int32_t n_draft = 5; // number of tokens to draft during speculative decoding
int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited)
int32_t n_parallel = 1; // number of parallel sequences to decode
int32_t n_sequences = 1; // number of sequences to decode
float p_split = 0.1f; // speculative decoding split probability
int32_t n_gpu_layers = -1; // number of layers to store in VRAM (-1 - use default)
int32_t n_gpu_layers_draft = -1; // number of layers to store in VRAM for the draft model (-1 - use default)
int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
int32_t n_beams = 0; // if non-zero then use beam search of given width.
int32_t grp_attn_n = 1; // group-attention factor
int32_t grp_attn_w = 512; // group-attention width
int32_t n_print = -1; // print token count every n tokens (-1 = disabled)
float rope_freq_base = 0.0f; // RoPE base frequency
float rope_freq_scale = 0.0f; // RoPE frequency scaling factor
float yarn_ext_factor = -1.0f; // YaRN extrapolation mix factor
float yarn_attn_factor = 1.0f; // YaRN magnitude scaling factor
float yarn_attn_factor = 1.0f; // YaRN magnitude scaling factor
float yarn_beta_fast = 32.0f; // YaRN low correction dim
float yarn_beta_slow = 1.0f; // YaRN high correction dim
int32_t yarn_orig_ctx = 0; // YaRN original context length
float yarn_beta_slow = 1.0f; // YaRN high correction dim
int32_t yarn_orig_ctx = 0; // YaRN original context length
float defrag_thold = -1.0f; // KV cache defragmentation threshold
std::string rpc_servers = ""; // comma separated list of RPC servers
ggml_backend_sched_eval_callback cb_eval = nullptr;
void * cb_eval_user_data = nullptr;
ggml_numa_strategy numa = GGML_NUMA_STRATEGY_DISABLED;
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
enum llama_rope_scaling_type rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED;
enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED; // pooling type for embeddings
// // sampling parameters
struct llama_sampling_params sparams;
std::string model = ""; // model path
std::string model_draft = ""; // draft model for speculative decoding
std::string model = ""; // model path
std::string model_draft = ""; // draft model for speculative decoding
std::string model_alias = "unknown"; // model alias
std::string model_url = ""; // model url to download
std::string hf_repo = ""; // HF repo
std::string hf_file = ""; // HF file
std::string model_url = ""; // model url to download
std::string hf_repo = ""; // HF repo
std::string hf_file = ""; // HF file
std::string prompt = "";
std::string prompt_file = ""; // store the external prompt file name
std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state
std::string input_prefix = ""; // string to prefix user inputs with
std::string input_suffix = ""; // string to suffix user inputs with
std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted
std::string logdir = ""; // directory in which to save YAML log files
std::string prompt_file = ""; // store the external prompt file name
std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state
std::string input_prefix = ""; // string to prefix user inputs with
std::string input_suffix = ""; // string to suffix user inputs with
std::string logdir = ""; // directory in which to save YAML log files
std::string lookup_cache_static = ""; // path of static ngram cache file for lookup decoding
std::string lookup_cache_dynamic = ""; // path of dynamic ngram cache file for lookup decoding
std::string logits_file = ""; // file for saving *all* logits
std::string logits_file = ""; // file for saving *all* logits
std::string rpc_servers = ""; // comma separated list of RPC servers
std::vector<std::string> in_files; // all input files
std::vector<std::string> antiprompt; // strings upon which more user input is prompted (a.k.a. reverse prompts)
std::vector<llama_model_kv_override> kv_overrides;
// TODO: avoid tuple, use struct
@@ -124,37 +125,36 @@ struct gpt_params {
std::vector<llama_control_vector_load_info> control_vectors; // control vector with user defined scale
int32_t verbosity = 0;
int32_t control_vector_layer_start = -1; // layer range for control vector
int32_t control_vector_layer_end = -1; // layer range for control vector
int ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
// (which is more convenient to use for plotting)
//
bool hellaswag = false; // compute HellaSwag score over random tasks from datafile supplied in prompt
size_t hellaswag_tasks = 400; // number of tasks to use when computing the HellaSwag score
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
// (which is more convenient to use for plotting)
//
bool hellaswag = false; // compute HellaSwag score over random tasks from datafile supplied in prompt
size_t hellaswag_tasks = 400; // number of tasks to use when computing the HellaSwag score
bool winogrande = false; // compute Winogrande score over random tasks from datafile supplied in prompt
size_t winogrande_tasks= 0; // number of tasks to use when computing the Winogrande score. If 0, all tasks will be computed
bool winogrande = false; // compute Winogrande score over random tasks from datafile supplied in prompt
size_t winogrande_tasks = 0; // number of tasks to use when computing the Winogrande score. If 0, all tasks will be computed
bool multiple_choice = false; // compute TruthfulQA score over random tasks from datafile supplied in prompt
size_t multiple_choice_tasks = 0; // number of tasks to use when computing the TruthfulQA score. If 0, all tasks will be computed
bool multiple_choice = false; // compute TruthfulQA score over random tasks from datafile supplied in prompt
size_t multiple_choice_tasks = 0; // number of tasks to use when computing the TruthfulQA score. If 0, all tasks will be computed
bool kl_divergence = false; // compute KL divergence
bool kl_divergence = false; // compute KL divergence
bool random_prompt = false; // do not randomize prompt if none provided
bool usage = false; // print usage
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool interactive_specials = false; // whether to allow special tokens from user, during interactive mode
bool special = false; // enable special token output
bool interactive = false; // interactive mode
bool interactive_first = false; // wait for user input immediately
bool conversation = false; // conversation mode (does not print special tokens and suffix/prefix)
bool chatml = false; // chatml mode (used for models trained on chatml syntax)
bool prompt_cache_all = false; // save user input and generations to prompt cache
bool prompt_cache_ro = false; // open the prompt cache read-only and do not update it
bool embedding = false; // get only sentence embedding
bool escape = false; // escape "\n", "\r", "\t", "\'", "\"", and "\\"
bool interactive_first = false; // wait for user input immediately
bool escape = true; // escape "\n", "\r", "\t", "\'", "\"", and "\\"
bool multiline_input = false; // reverse the usage of `\`
bool simple_io = false; // improves compatibility with subprocesses and limited consoles
bool cont_batching = true; // insert new sequences for decoding on-the-fly
@@ -162,7 +162,6 @@ struct gpt_params {
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
bool ignore_eos = false; // ignore generated EOS tokens
bool instruct = false; // instruction mode (used for Alpaca models)
bool logits_all = false; // return logits for all tokens in the batch
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
@@ -180,6 +179,59 @@ struct gpt_params {
// multimodal models (see examples/llava)
std::string mmproj = ""; // path to multimodal projector
std::vector<std::string> image; // path to image file(s)
// server params
int32_t port = 8080; // server listens on this network port
int32_t timeout_read = 600; // http read timeout in seconds
int32_t timeout_write = timeout_read; // http write timeout in seconds
int32_t n_threads_http = -1; // number of threads to process HTTP requests
std::string hostname = "127.0.0.1";
std::string public_path = "";
std::string chat_template = "";
std::string system_prompt = "";
std::vector<std::string> api_keys;
std::string ssl_file_key = "";
std::string ssl_file_cert = "";
bool endpoint_slots = true;
bool endpoint_metrics = false;
bool log_json = false;
std::string slot_save_path;
float slot_prompt_similarity = 0.5f;
// batched-bench params
bool is_pp_shared = false;
std::vector<int32_t> n_pp;
std::vector<int32_t> n_tg;
std::vector<int32_t> n_pl;
// retrieval params
std::vector<std::string> context_files; // context files to embed
int32_t chunk_size = 64; // chunk size for context embedding
std::string chunk_separator = "\n"; // chunk separator for context embedding
// passkey params
int32_t n_junk = 250; // number of times to repeat the junk text
int32_t i_pos = -1; // position of the passkey in the junk text
// imatrix params
std::string out_file = "imatrix.dat"; // save the resulting imatrix to this file
int32_t n_out_freq = 10; // output the imatrix every n_out_freq iterations
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
int32_t i_chunk = 0; // start processing from this chunk
bool process_output = false; // collect data for the output tensor
bool compute_ppl = true; // whether to compute perplexity
};
void gpt_params_handle_model_default(gpt_params & params);
@@ -199,7 +251,20 @@ std::vector<std::string> string_split(std::string input, char separator);
std::string string_strip(const std::string & str);
std::string string_get_sortable_timestamp();
std::string string_random_prompt(std::mt19937 & rng);
template<class T>
static std::vector<T> string_split(const std::string & str, char delim) {
std::vector<T> values;
std::istringstream str_stream(str);
std::string token;
while (std::getline(str_stream, token, delim)) {
T value;
std::istringstream token_stream(token);
token_stream >> value;
values.push_back(value);
}
return values;
}
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
void string_process_escapes(std::string & input);
@@ -212,6 +277,7 @@ bool fs_validate_filename(const std::string & filename);
bool fs_create_directory_with_parents(const std::string & path);
std::string fs_get_cache_directory();
std::string fs_get_cache_file(const std::string & filename);
//
// Model utils
@@ -282,6 +348,13 @@ std::string llama_detokenize_bpe(
// defaults to true when model type is SPM, otherwise false.
bool llama_should_add_bos_token(const llama_model * model);
//
// Chat template utils
//
// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
bool llama_chat_verify_template(const std::string & tmpl);
//
// KV cache utils
//

View File

@@ -46,8 +46,12 @@ namespace grammar_parser {
state.rules[rule_id] = rule;
}
static bool is_digit_char(char c) {
return '0' <= c && c <= '9';
}
static bool is_word_char(char c) {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || c == '-' || ('0' <= c && c <= '9');
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z') || c == '-' || is_digit_char(c);
}
static std::pair<uint32_t, const char *> parse_hex(const char * src, int size) {
@@ -99,6 +103,17 @@ namespace grammar_parser {
return pos;
}
static const char * parse_int(const char * src) {
const char * pos = src;
while (is_digit_char(*pos)) {
pos++;
}
if (pos == src) {
throw std::runtime_error(std::string("expecting integer at ") + src);
}
return pos;
}
static std::pair<uint32_t, const char *> parse_char(const char * src) {
if (*src == '\\') {
switch (src[1]) {
@@ -137,6 +152,60 @@ namespace grammar_parser {
bool is_nested) {
size_t last_sym_start = out_elements.size();
const char * pos = src;
auto handle_repetitions = [&](int min_times, int max_times) {
if (last_sym_start == out_elements.size()) {
throw std::runtime_error(std::string("expecting preceding item to */+/?/{ at ") + pos);
}
// apply transformation to previous symbol (last_sym_start to end) according to
// the following rewrite rules:
// S{m,n} --> S S S (m times) S'(n-m)
// S'(x) ::= S S'(x-1) |
// (... n-m definitions of these S' rules ...)
// S'(1) ::= S |
// S{m,} --> S S S (m times) S'
// S' ::= S S' |
// S* --> S{0,}
// --> S' ::= S S' |
// S+ --> S{1,}
// --> S S'
// S' ::= S S' |
// S? --> S{0,1}
// --> S'
// S' ::= S |
std::vector<llama_grammar_element> previous_elements(out_elements.begin() + last_sym_start, out_elements.end());
if (min_times == 0) {
out_elements.resize(last_sym_start);
} else {
// Repeat the previous elements (min_times - 1) times
for (int i = 1; i < min_times; i++) {
out_elements.insert(out_elements.end(), previous_elements.begin(), previous_elements.end());
}
}
uint32_t last_rec_rule_id = 0;
auto n_opt = max_times < 0 ? 1 : max_times - min_times;
std::vector<llama_grammar_element> rec_rule(previous_elements);
for (int i = 0; i < n_opt; i++) {
rec_rule.resize(previous_elements.size());
uint32_t rec_rule_id = generate_symbol_id(state, rule_name);
if (i > 0 || max_times < 0) {
rec_rule.push_back({LLAMA_GRETYPE_RULE_REF, max_times < 0 ? rec_rule_id : last_rec_rule_id});
}
rec_rule.push_back({LLAMA_GRETYPE_ALT, 0});
rec_rule.push_back({LLAMA_GRETYPE_END, 0});
add_rule(state, rec_rule_id, rec_rule);
last_rec_rule_id = rec_rule_id;
}
if (n_opt > 0) {
out_elements.push_back({LLAMA_GRETYPE_RULE_REF, last_rec_rule_id});
}
};
while (*pos) {
if (*pos == '"') { // literal string
pos++;
@@ -197,40 +266,51 @@ namespace grammar_parser {
throw std::runtime_error(std::string("expecting ')' at ") + pos);
}
pos = parse_space(pos + 1, is_nested);
} else if (*pos == '*' || *pos == '+' || *pos == '?') { // repetition operator
if (last_sym_start == out_elements.size()) {
throw std::runtime_error(std::string("expecting preceding item to */+/? at ") + pos);
}
// apply transformation to previous symbol (last_sym_start to end) according to
// rewrite rules:
// S* --> S' ::= S S' |
// S+ --> S' ::= S S' | S
// S? --> S' ::= S |
uint32_t sub_rule_id = generate_symbol_id(state, rule_name);
std::vector<llama_grammar_element> sub_rule;
// add preceding symbol to generated rule
sub_rule.insert(
sub_rule.end(), out_elements.begin() + last_sym_start, out_elements.end());
if (*pos == '*' || *pos == '+') {
// cause generated rule to recurse
sub_rule.push_back({LLAMA_GRETYPE_RULE_REF, sub_rule_id});
}
// mark start of alternate def
sub_rule.push_back({LLAMA_GRETYPE_ALT, 0});
if (*pos == '+') {
// add preceding symbol as alternate only for '+' (otherwise empty)
sub_rule.insert(
sub_rule.end(), out_elements.begin() + last_sym_start, out_elements.end());
}
sub_rule.push_back({LLAMA_GRETYPE_END, 0});
add_rule(state, sub_rule_id, sub_rule);
// in original rule, replace previous symbol with reference to generated rule
out_elements.resize(last_sym_start);
out_elements.push_back({LLAMA_GRETYPE_RULE_REF, sub_rule_id});
} else if (*pos == '.') { // any char
last_sym_start = out_elements.size();
out_elements.push_back({LLAMA_GRETYPE_CHAR_ANY, 0});
pos = parse_space(pos + 1, is_nested);
} else if (*pos == '*') {
pos = parse_space(pos + 1, is_nested);
handle_repetitions(0, -1);
} else if (*pos == '+') {
pos = parse_space(pos + 1, is_nested);
handle_repetitions(1, -1);
} else if (*pos == '?') {
pos = parse_space(pos + 1, is_nested);
handle_repetitions(0, 1);
} else if (*pos == '{') {
pos = parse_space(pos + 1, is_nested);
if (!is_digit_char(*pos)) {
throw std::runtime_error(std::string("expecting an int at ") + pos);
}
const char * int_end = parse_int(pos);
int min_times = std::stoul(std::string(pos, int_end - pos));
pos = parse_space(int_end, is_nested);
int max_times = -1;
if (*pos == '}') {
max_times = min_times;
pos = parse_space(pos + 1, is_nested);
} else if (*pos == ',') {
pos = parse_space(pos + 1, is_nested);
if (is_digit_char(*pos)) {
const char * int_end = parse_int(pos);
max_times = std::stoul(std::string(pos, int_end - pos));
pos = parse_space(int_end, is_nested);
}
if (*pos != '}') {
throw std::runtime_error(std::string("expecting '}' at ") + pos);
}
pos = parse_space(pos + 1, is_nested);
} else {
throw std::runtime_error(std::string("expecting ',' at ") + pos);
}
handle_repetitions(min_times, max_times);
} else {
break;
}
@@ -325,6 +405,7 @@ namespace grammar_parser {
case LLAMA_GRETYPE_CHAR_NOT: return true;
case LLAMA_GRETYPE_CHAR_ALT: return true;
case LLAMA_GRETYPE_CHAR_RNG_UPPER: return true;
case LLAMA_GRETYPE_CHAR_ANY: return true;
default: return false;
}
}
@@ -339,6 +420,7 @@ namespace grammar_parser {
case LLAMA_GRETYPE_CHAR_NOT: fprintf(file, "CHAR_NOT"); break;
case LLAMA_GRETYPE_CHAR_RNG_UPPER: fprintf(file, "CHAR_RNG_UPPER"); break;
case LLAMA_GRETYPE_CHAR_ALT: fprintf(file, "CHAR_ALT"); break;
case LLAMA_GRETYPE_CHAR_ANY: fprintf(file, "CHAR_ANY"); break;
}
switch (elem.type) {
case LLAMA_GRETYPE_END:
@@ -350,6 +432,7 @@ namespace grammar_parser {
case LLAMA_GRETYPE_CHAR_NOT:
case LLAMA_GRETYPE_CHAR_RNG_UPPER:
case LLAMA_GRETYPE_CHAR_ALT:
case LLAMA_GRETYPE_CHAR_ANY:
fprintf(file, "(\"");
print_grammar_char(file, elem.value);
fprintf(file, "\") ");
@@ -407,11 +490,15 @@ namespace grammar_parser {
}
print_grammar_char(file, elem.value);
break;
case LLAMA_GRETYPE_CHAR_ANY:
fprintf(file, ".");
break;
}
if (is_char_element(elem)) {
switch (rule[i + 1].type) {
case LLAMA_GRETYPE_CHAR_ALT:
case LLAMA_GRETYPE_CHAR_RNG_UPPER:
case LLAMA_GRETYPE_CHAR_ANY:
break;
default:
fprintf(file, "] ");

View File

@@ -16,58 +16,27 @@ static std::string join(Iterator begin, Iterator end, const std::string & separa
static std::string repeat(const std::string & str, size_t n);
static std::string build_repetition(const std::string & item_rule, int min_items, int max_items, const std::string & separator_rule = "", bool item_rule_is_literal = false) {
static std::string build_repetition(const std::string & item_rule, int min_items, int max_items, const std::string & separator_rule = "") {
auto has_max = max_items != std::numeric_limits<int>::max();
if (min_items == 0 && max_items == 1) {
return item_rule + "?";
}
if (separator_rule.empty()) {
if (min_items == 0 && max_items == 1) {
return item_rule + "?";
} else if (min_items == 1 && max_items == std::numeric_limits<int>::max()) {
if (min_items == 1 && !has_max) {
return item_rule + "+";
}
}
std::string result;
if (min_items > 0) {
if (item_rule_is_literal && separator_rule.empty()) {
result = "\"" + repeat(std::string(item_rule.begin() + 1, item_rule.end() - 1), min_items) + "\"";
} else if (min_items == 0 && !has_max) {
return item_rule + "*";
} else {
std::vector<std::string> items(min_items, item_rule);
result = join(items.begin(), items.end(), separator_rule.empty() ? " " : " " + separator_rule + " ");
return item_rule + "{" + std::to_string(min_items) + "," + (has_max ? std::to_string(max_items) : "") + "}";
}
}
std::function<std::string(int, bool)> opt_repetitions = [&](int up_to_n, bool prefix_with_sep) -> std::string {
auto content = prefix_with_sep && !separator_rule.empty() ? separator_rule + " " + item_rule : item_rule;
if (up_to_n == 0) {
return "";
} else if (up_to_n == 1) {
return "(" + content + ")?";
} else if (!separator_rule.empty() && !prefix_with_sep) {
return "(" + content + " " + opt_repetitions(up_to_n - 1, true) + ")?";
} else {
std::string res = repeat("(" + content + " ", up_to_n);
// strip trailing space
res = res.substr(0, res.length() - 1);
res += repeat(")?", up_to_n);
return res;
}
};
if (min_items > 0 && max_items != min_items) {
result += " ";
auto result = item_rule + " " + build_repetition("(" + separator_rule + " " + item_rule + ")", min_items == 0 ? 0 : min_items - 1, has_max ? max_items - 1 : max_items);
if (min_items == 0) {
result = "(" + result + ")?";
}
if (max_items != std::numeric_limits<int>::max()) {
result += opt_repetitions(max_items - min_items, min_items > 0);
} else {
std::string item_operator = "(" + (separator_rule.empty() ? "" : separator_rule + " ") + item_rule + ")";
if (min_items == 0 && !separator_rule.empty()) {
result = "(" + item_rule + " " + item_operator + "*)?";
} else {
result += item_operator + "*";
}
}
return result;
}
@@ -78,30 +47,24 @@ struct BuiltinRule {
std::vector<std::string> deps;
};
const std::string _up_to_15_digits = build_repetition("[0-9]", 0, 15);
std::unordered_map<std::string, BuiltinRule> PRIMITIVE_RULES = {
{"boolean", {"(\"true\" | \"false\") space", {}}},
{"decimal-part", {"[0-9] " + _up_to_15_digits, {}}},
{"integral-part", {"[0-9] | [1-9] " + _up_to_15_digits, {}}},
{"decimal-part", {"[0-9]{1,16}", {}}},
{"integral-part", {"[0] | [1-9] [0-9]{0,15}", {}}},
{"number", {"(\"-\"? integral-part) (\".\" decimal-part)? ([eE] [-+]? integral-part)? space", {"integral-part", "decimal-part"}}},
{"integer", {"(\"-\"? integral-part) space", {"integral-part"}}},
{"value", {"object | array | string | number | boolean | null", {"object", "array", "string", "number", "boolean", "null"}}},
{"object", {"\"{\" space ( string \":\" space value (\",\" space string \":\" space value)* )? \"}\" space", {"string", "value"}}},
{"array", {"\"[\" space ( value (\",\" space value)* )? \"]\" space", {"value"}}},
{"uuid", {"\"\\\"\" [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F] "
"\"-\" [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F] "
"\"-\" [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F] "
"\"-\" [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F] "
"\"-\" [0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F] \"\\\"\" space", {}}},
{"char", {"[^\"\\\\] | \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])", {}}},
{"uuid", {"\"\\\"\" [0-9a-fA-F]{8} \"-\" [0-9a-fA-F]{4} \"-\" [0-9a-fA-F]{4} \"-\" [0-9a-fA-F]{4} \"-\" [0-9a-fA-F]{12} \"\\\"\" space", {}}},
{"char", {"[^\"\\\\] | \"\\\\\" ([\"\\\\/bfnrt] | \"u\" [0-9a-fA-F]{4})", {}}},
{"string", {"\"\\\"\" char* \"\\\"\" space", {"char"}}},
{"null", {"\"null\" space", {}}},
};
std::unordered_map<std::string, BuiltinRule> STRING_FORMAT_RULES = {
{"date", {"[0-9] [0-9] [0-9] [0-9] \"-\" ( \"0\" [1-9] | \"1\" [0-2] ) \"-\" ( \"0\" [1-9] | [1-2] [0-9] | \"3\" [0-1] )", {}}},
{"time", {"([01] [0-9] | \"2\" [0-3]) \":\" [0-5] [0-9] \":\" [0-5] [0-9] ( \".\" [0-9] [0-9] [0-9] )? ( \"Z\" | ( \"+\" | \"-\" ) ( [01] [0-9] | \"2\" [0-3] ) \":\" [0-5] [0-9] )", {}}},
{"date", {"[0-9]{4} \"-\" ( \"0\" [1-9] | \"1\" [0-2] ) \"-\" ( \"0\" [1-9] | [1-2] [0-9] | \"3\" [0-1] )", {}}},
{"time", {"([01] [0-9] | \"2\" [0-3]) \":\" [0-5] [0-9] \":\" [0-5] [0-9] ( \".\" [0-9]{3} )? ( \"Z\" | ( \"+\" | \"-\" ) ( [01] [0-9] | \"2\" [0-3] ) \":\" [0-5] [0-9] )", {}}},
{"date-time", {"date \"T\" time", {"date", "time"}}},
{"date-string", {"\"\\\"\" date \"\\\"\" space", {"date"}}},
{"time-string", {"\"\\\"\" time \"\\\"\" space", {"time"}}},
@@ -385,8 +348,7 @@ private:
sub_is_literal ? "\"" + sub + "\"" : sub,
min_times,
max_times,
"",
sub_is_literal
""
);
seq.back().second = false;
} else {

View File

@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# This script downloads the tokenizer models of the specified models from Huggingface and
# generates the get_vocab_base_pre() function for convert-hf-to-gguf.py
@@ -82,6 +83,7 @@ models = [
{"name": "jina-v2-es", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-es", },
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-de", },
{"name": "smaug-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct", },
{"name": "jina-v2-code", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-code", },
]

View File

@@ -1,4 +1,5 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import annotations
@@ -46,11 +47,12 @@ class Model:
_model_classes: dict[str, type[Model]] = {}
dir_model: Path
ftype: int
ftype: gguf.LlamaFileType
is_big_endian: bool
endianess: gguf.GGUFEndian
use_temp_file: bool
lazy: bool
model_name: str | None
part_names: list[str]
is_safetensors: bool
hparams: dict[str, Any]
@@ -63,7 +65,7 @@ class Model:
# subclasses should define this!
model_arch: gguf.MODEL_ARCH
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool):
def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool, use_temp_file: bool, eager: bool, model_name: str | None):
if type(self) is Model:
raise TypeError(f"{type(self).__name__!r} should not be directly instantiated")
self.dir_model = dir_model
@@ -72,10 +74,11 @@ class Model:
self.endianess = gguf.GGUFEndian.BIG if is_big_endian else gguf.GGUFEndian.LITTLE
self.use_temp_file = use_temp_file
self.lazy = not eager
self.part_names = Model.get_model_part_names(self.dir_model, ".safetensors")
self.model_name = model_name
self.part_names = Model.get_model_part_names(self.dir_model, "model", ".safetensors")
self.is_safetensors = len(self.part_names) > 0
if not self.is_safetensors:
self.part_names = Model.get_model_part_names(self.dir_model, ".bin")
self.part_names = Model.get_model_part_names(self.dir_model, "pytorch_model", ".bin")
self.hparams = Model.load_hparams(self.dir_model)
self.block_count = self.find_hparam(["n_layers", "num_hidden_layers", "n_layer"])
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)
@@ -93,7 +96,7 @@ class Model:
ftype_lw: str = ftype_up.lower()
# allow templating the file name with the output ftype, useful with the "auto" ftype
self.fname_out = fname_out.parent / fname_out.name.format(ftype_lw, outtype=ftype_lw, ftype=ftype_lw, OUTTYPE=ftype_up, FTYPE=ftype_up)
self.gguf_writer = gguf.GGUFWriter(self.fname_out, gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
self.gguf_writer = gguf.GGUFWriter(path=None, arch=gguf.MODEL_ARCH_NAMES[self.model_arch], endianess=self.endianess, use_temp_file=self.use_temp_file)
@classmethod
def __init_subclass__(cls):
@@ -181,7 +184,7 @@ class Model:
return new_name
def set_gguf_parameters(self):
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_block_count(self.block_count)
if (n_ctx := self.find_hparam(["max_position_embeddings", "n_ctx"], optional=True)) is not None:
@@ -323,21 +326,21 @@ class Model:
def write(self):
self.write_tensors()
self.gguf_writer.write_header_to_file()
self.gguf_writer.write_header_to_file(self.fname_out)
self.gguf_writer.write_kv_data_to_file()
self.gguf_writer.write_tensors_to_file(progress=True)
self.gguf_writer.close()
def write_vocab(self):
self.gguf_writer.write_header_to_file()
self.gguf_writer.write_header_to_file(self.fname_out)
self.gguf_writer.write_kv_data_to_file()
self.gguf_writer.close()
@staticmethod
def get_model_part_names(dir_model: Path, suffix: str) -> list[str]:
def get_model_part_names(dir_model: Path, prefix: str, suffix: str) -> list[str]:
part_names: list[str] = []
for filename in os.listdir(dir_model):
if filename.endswith(suffix):
if filename.startswith(prefix) and filename.endswith(suffix):
part_names.append(filename)
part_names.sort()
@@ -474,6 +477,9 @@ class Model:
if chkhsh == "c136ed14d01c2745d4f60a9596ae66800e2b61fa45643e72436041855ad4089d":
# ref: https://huggingface.co/abacusai/Smaug-Llama-3-70B-Instruct
res = "smaug-bpe"
if chkhsh == "7967bfa498ade6b757b064f31e964dddbb80f8f9a4d68d4ba7998fcf281c531a":
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-code
res = "jina-v2-code"
if res is None:
logger.warning("\n")
@@ -661,7 +667,7 @@ class GPTNeoXModel(Model):
def set_gguf_parameters(self):
block_count = self.hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(self.hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count)
@@ -794,7 +800,7 @@ class MPTModel(Model):
def set_gguf_parameters(self):
block_count = self.hparams["n_layers"]
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(self.hparams["max_seq_len"])
self.gguf_writer.add_embedding_length(self.hparams["d_model"])
self.gguf_writer.add_block_count(block_count)
@@ -846,7 +852,7 @@ class OrionModel(Model):
raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_file_type(self.ftype)
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length)
@@ -883,7 +889,7 @@ class BaichuanModel(Model):
else:
raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length)
@@ -1006,7 +1012,7 @@ class XverseModel(Model):
else:
raise ValueError("gguf: can not find ctx length parameter.")
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_source_hf_repo(hf_repo)
self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
self.gguf_writer.add_context_length(ctx_length)
@@ -1202,7 +1208,7 @@ class StableLMModel(Model):
hparams = self.hparams
block_count = hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count)
@@ -1677,7 +1683,7 @@ class GPT2Model(Model):
model_arch = gguf.MODEL_ARCH.GPT2
def set_gguf_parameters(self):
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_block_count(self.hparams["n_layer"])
self.gguf_writer.add_context_length(self.hparams["n_ctx"])
self.gguf_writer.add_embedding_length(self.hparams["n_embd"])
@@ -2244,7 +2250,7 @@ class GemmaModel(Model):
hparams = self.hparams
block_count = hparams["num_hidden_layers"]
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(hparams["max_position_embeddings"])
self.gguf_writer.add_embedding_length(hparams["hidden_size"])
self.gguf_writer.add_block_count(block_count)
@@ -2344,7 +2350,7 @@ class MambaModel(Model):
# Fail early for models which don't have a block expansion factor of 2
assert d_inner == 2 * d_model
self.gguf_writer.add_name(self.dir_model.name)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_context_length(2**20) # arbitrary value; for those who use the default
self.gguf_writer.add_embedding_length(d_model)
self.gguf_writer.add_feed_forward_length(0) # unused, but seemingly required when loading
@@ -2451,11 +2457,13 @@ class JinaBertV2Model(BertModel):
def get_tensors(self):
for name, data in super().get_tensors():
if 'gated_layers' in name:
if 'gated_layer' in name:
d1 = data[:self.intermediate_size, :]
name1 = name.replace('gated_layers', 'gated_layers_w')
name1 = name1.replace('up_gated_layer', 'gated_layers_v')
d2 = data[self.intermediate_size:, :]
name2 = name.replace('gated_layers', 'gated_layers_v')
name2 = name2.replace('up_gated_layer', 'gated_layers_w')
yield name1, d1
yield name2, d2
continue
@@ -2840,8 +2848,13 @@ def main() -> None:
hparams = Model.load_hparams(dir_model)
with torch.inference_mode():
model_class = Model.from_model_architecture(hparams["architectures"][0])
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy)
try:
model_class = Model.from_model_architecture(hparams["architectures"][0])
except NotImplementedError:
logger.error(f"Model {hparams['architectures'][0]} is not supported")
sys.exit(1)
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file, args.no_lazy, args.model_name)
logger.info("Set model parameters")
model_instance.set_gguf_parameters()

View File

@@ -15,7 +15,6 @@ else()
add_subdirectory(baby-llama)
add_subdirectory(batched)
add_subdirectory(batched-bench)
add_subdirectory(beam-search)
add_subdirectory(benchmark)
add_subdirectory(convert-llama2c-to-ggml)
add_subdirectory(embedding)

View File

@@ -1,19 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./main -m ./models/alpaca.13b.ggmlv3.q8_0.bin \
--color \
-f ./prompts/alpaca.txt \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 7

View File

@@ -522,8 +522,8 @@ static struct ggml_tensor * forward(
// wk shape [n_embd, n_embd, 1, 1]
// Qcur shape [n_embd/n_head, n_head, N, 1]
// Kcur shape [n_embd/n_head, n_head, N, 1]
struct ggml_tensor * Qcur = ggml_rope(ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wq, cur), n_embd/n_head, n_head, N), KQ_pos, n_rot, 0, 0);
struct ggml_tensor * Kcur = ggml_rope(ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wk, cur), n_embd/n_head, n_head, N), KQ_pos, n_rot, 0, 0);
struct ggml_tensor * Qcur = ggml_rope(ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wq, cur), n_embd/n_head, n_head, N), KQ_pos, n_rot, 0);
struct ggml_tensor * Kcur = ggml_rope(ctx0, ggml_reshape_3d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wk, cur), n_embd/n_head, n_head, N), KQ_pos, n_rot, 0);
// store key and value to memory
{
@@ -759,8 +759,8 @@ static struct ggml_tensor * forward_batch(
// wk shape [n_embd, n_embd, 1, 1]
// Qcur shape [n_embd/n_head, n_head, N, n_batch]
// Kcur shape [n_embd/n_head, n_head, N, n_batch]
struct ggml_tensor * Qcur = ggml_rope(ctx0, ggml_reshape_4d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wq, cur), n_embd/n_head, n_head, N, n_batch), KQ_pos, n_rot, 0, 0);
struct ggml_tensor * Kcur = ggml_rope(ctx0, ggml_reshape_4d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wk, cur), n_embd/n_head, n_head, N, n_batch), KQ_pos, n_rot, 0, 0);
struct ggml_tensor * Qcur = ggml_rope(ctx0, ggml_reshape_4d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wq, cur), n_embd/n_head, n_head, N, n_batch), KQ_pos, n_rot, 0);
struct ggml_tensor * Kcur = ggml_rope(ctx0, ggml_reshape_4d(ctx0, ggml_mul_mat(ctx0, model->layers[il].wk, cur), n_embd/n_head, n_head, N, n_batch), KQ_pos, n_rot, 0);
assert_shape_4d(Qcur, n_embd/n_head, n_head, N, n_batch);
assert_shape_4d(Kcur, n_embd/n_head, n_head, N, n_batch);
@@ -1056,7 +1056,7 @@ static struct ggml_tensor * forward_lora(
model->layers[il].wqb,
cur)),
n_embd/n_head, n_head, N),
KQ_pos, n_rot, 0, 0);
KQ_pos, n_rot, 0);
struct ggml_tensor * Kcur = ggml_rope(ctx0,
ggml_reshape_3d(ctx0,
ggml_mul_mat(ctx0,
@@ -1065,7 +1065,7 @@ static struct ggml_tensor * forward_lora(
model->layers[il].wkb,
cur)),
n_embd/n_head, n_head, N),
KQ_pos, n_rot, 0, 0);
KQ_pos, n_rot, 0);
// store key and value to memory
{

View File

@@ -10,16 +10,16 @@ There are 2 modes of operation:
- `prompt is shared` - there is a common prompt of size `PP` used by all batches (i.e. `N_KV = PP + B*TG`)
```bash
./batched-bench MODEL_PATH [N_KV_MAX] [N_BATCH] [N_UBATCH] [IS_PP_SHARED] [NGL] [MMQ] <PP> <TG> <PL>
./batched-bench -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps]
# LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), prompt not shared
./batched-bench ./models/llama-7b/ggml-model-f16.gguf 16384 2048 512 0 99
./batched-bench -m ./models/llama-7b/ggml-model-f16.gguf -c 16384 -b 2048 -ub 512 -ngl 99
# LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), prompt is shared
./batched-bench ./models/llama-7b/ggml-model-q8_0.gguf 16384 2048 512 1 99
./batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 16384 -b 2048 -ub 512 -ngl 99 -pps
# custom set of batches
./batched-bench ./models/llama-7b/ggml-model-q8_0.gguf 2048 512 512 0 999 0 128,256,512 128,256 1,2,4,8,16,32
./batched-bench -m ./models/llama-7b/ggml-model-q8_0.gguf -c 2048 -b 512 -ub 512 -ngl 999 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32
```
## Sample results

View File

@@ -28,67 +28,27 @@ static std::vector<int> parse_list(char * p) {
return ret;
}
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s -m model.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 [-pps]\n", argv[0]);
LOG_TEE("\n");
}
int main(int argc, char ** argv) {
gpt_params params;
if (argc == 1 || argv[1][0] == '-') {
printf("usage: %s MODEL_PATH [N_KV_MAX] [N_BATCH] [N_UBATCH] [FATTN] [IS_PP_SHARED] [NGL] <PP> <TG> <PL>\n" , argv[0]);
printf(" <PP>, <TG> and PL are comma-separated lists of numbers without spaces\n\n");
printf(" example: %s ggml-model-f16.gguf 2048 2048 512 0 999 128,256,512 128,256 1,2,4,8,16,32\n\n", argv[0]);
return 1 ;
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
int n_kv_max = 2048;
int n_batch = 2048;
int n_ubatch = 512;
bool flash_attn = false;
int is_pp_shared = 0;
int n_gpu_layers = 0;
int is_pp_shared = params.is_pp_shared;
std::vector<int> n_pp = { 128, 256, 512, 1024, 2048, 3584, 7680, };
std::vector<int> n_tg = { 128, 256, };
std::vector<int> n_pl = { 1, 2, 4, 8, 16, 32, };
//std::vector<int> n_pl = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, };
if (argc >= 2) {
params.model = argv[1];
}
if (argc >= 3) {
n_kv_max = std::atoi(argv[2]);
}
if (argc >= 4) {
n_batch = std::atoi(argv[3]);
}
if (argc >= 5) {
n_ubatch = std::atoi(argv[4]);
}
if (argc >= 6) {
flash_attn = std::atoi(argv[5]);
}
if (argc >= 7) {
is_pp_shared = std::atoi(argv[6]);
}
if (argc >= 8) {
n_gpu_layers = std::atoi(argv[7]);
}
if (argc >= 9) {
n_pp = parse_list(argv[8]);
}
if (argc >= 10) {
n_tg = parse_list(argv[9]);
}
if (argc >= 11) {
n_pl = parse_list(argv[10]);
}
std::vector<int> n_pp = params.n_pp;
std::vector<int> n_tg = params.n_tg;
std::vector<int> n_pl = params.n_pl;
// init LLM
@@ -97,12 +57,7 @@ int main(int argc, char ** argv) {
// initialize the model
llama_model_params model_params = llama_model_default_params();
const std::vector<float> t_split(llama_max_devices(), 0.0f);
model_params.n_gpu_layers = n_gpu_layers;
model_params.tensor_split = t_split.data();
llama_model_params model_params = llama_model_params_from_gpt_params(params);
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
@@ -111,16 +66,7 @@ int main(int argc, char ** argv) {
return 1;
}
llama_context_params ctx_params = llama_context_default_params();
ctx_params.seed = 1234;
ctx_params.n_ctx = n_kv_max;
ctx_params.n_batch = n_batch;
ctx_params.n_ubatch = n_ubatch;
ctx_params.flash_attn = flash_attn;
ctx_params.n_threads = params.n_threads;
ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
// ensure enough sequences are available
ctx_params.n_seq_max = *std::max_element(n_pl.begin(), n_pl.end());
@@ -132,6 +78,8 @@ int main(int argc, char ** argv) {
return 1;
}
const int32_t n_kv_max = llama_n_ctx(ctx);
llama_batch batch = llama_batch_init(n_kv_max, 0, 1);
// decode in batches of ctx_params.n_batch tokens
@@ -175,7 +123,7 @@ int main(int argc, char ** argv) {
}
LOG_TEE("\n");
LOG_TEE("%s: n_kv_max = %d, n_batch = %d, n_ubatch = %d, flash_attn = %d, is_pp_shared = %d, n_gpu_layers = %d, n_threads = %u, n_threads_batch = %u\n", __func__, n_kv_max, n_batch, n_ubatch, flash_attn, is_pp_shared, n_gpu_layers, ctx_params.n_threads, ctx_params.n_threads_batch);
LOG_TEE("%s: n_kv_max = %d, n_batch = %d, n_ubatch = %d, flash_attn = %d, is_pp_shared = %d, n_gpu_layers = %d, n_threads = %u, n_threads_batch = %u\n", __func__, n_kv_max, params.n_batch, params.n_ubatch, params.flash_attn, params.is_pp_shared, params.n_gpu_layers, ctx_params.n_threads, ctx_params.n_threads_batch);
LOG_TEE("\n");
LOG_TEE("|%6s | %6s | %4s | %6s | %8s | %8s | %8s | %8s | %8s | %8s |\n", "PP", "TG", "B", "N_KV", "T_PP s", "S_PP t/s", "T_TG s", "S_TG t/s", "T s", "S t/s");

View File

@@ -3,7 +3,7 @@
The example demonstrates batched generation from a given prompt
```bash
./batched ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is" 4
./batched -m ./models/llama-7b-v2/ggml-model-f16.gguf -p "Hello my name is" -np 4
...

View File

@@ -7,48 +7,31 @@
#include <string>
#include <vector>
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s -m model.gguf -p \"Hello my name is\" -n 32 -np 4\n", argv[0]);
LOG_TEE("\n");
}
int main(int argc, char ** argv) {
gpt_params params;
if (argc == 1 || argv[1][0] == '-') {
printf("usage: %s MODEL_PATH [PROMPT] [PARALLEL] [LEN] [NGL]\n" , argv[0]);
return 1 ;
params.prompt = "Hello my name is";
params.n_predict = 32;
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
// number of parallel batches
int n_parallel = 1;
int n_parallel = params.n_parallel;
// total length of the sequences including the prompt
int n_len = 32;
// number of layers to offload to the GPU
int n_gpu_layers = 0;
if (argc >= 2) {
params.model = argv[1];
}
if (argc >= 3) {
params.prompt = argv[2];
}
if (argc >= 4) {
n_parallel = std::atoi(argv[3]);
}
if (argc >= 5) {
n_len = std::atoi(argv[4]);
}
if (argc >= 6) {
n_gpu_layers = std::atoi(argv[5]);
}
if (params.prompt.empty()) {
params.prompt = "Hello my name is";
}
string_process_escapes(params.prompt);
int n_predict = 32;
// init LLM
@@ -57,9 +40,7 @@ int main(int argc, char ** argv) {
// initialize the model
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = n_gpu_layers;
llama_model_params model_params = llama_model_params_from_gpt_params(params);
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
@@ -73,18 +54,14 @@ int main(int argc, char ** argv) {
std::vector<llama_token> tokens_list;
tokens_list = ::llama_tokenize(model, params.prompt, true);
const int n_kv_req = tokens_list.size() + (n_len - tokens_list.size())*n_parallel;
const int n_kv_req = tokens_list.size() + (n_predict - tokens_list.size())*n_parallel;
// initialize the context
llama_context_params ctx_params = llama_context_default_params();
llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
ctx_params.seed = 1234;
ctx_params.n_ctx = n_kv_req;
ctx_params.n_batch = std::max(n_len, n_parallel);
ctx_params.n_seq_max = n_parallel;
ctx_params.n_threads = params.n_threads;
ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
ctx_params.n_batch = std::max(n_predict, n_parallel);
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
@@ -93,9 +70,9 @@ int main(int argc, char ** argv) {
return 1;
}
const int n_ctx = llama_n_ctx(ctx);
const int n_ctx = llama_n_ctx(ctx);
LOG_TEE("\n%s: n_len = %d, n_ctx = %d, n_batch = %u, n_parallel = %d, n_kv_req = %d\n", __func__, n_len, n_ctx, ctx_params.n_batch, n_parallel, n_kv_req);
LOG_TEE("\n%s: n_predict = %d, n_ctx = %d, n_batch = %u, n_parallel = %d, n_kv_req = %d\n", __func__, n_predict, n_ctx, ctx_params.n_batch, n_parallel, n_kv_req);
// make sure the KV cache is big enough to hold all the prompt and generated tokens
if (n_kv_req > n_ctx) {
@@ -156,7 +133,7 @@ int main(int argc, char ** argv) {
const auto t_main_start = ggml_time_us();
while (n_cur <= n_len) {
while (n_cur <= n_predict) {
// prepare the next batch
llama_batch_clear(batch);
@@ -192,7 +169,7 @@ int main(int argc, char ** argv) {
//const llama_token new_token_id = llama_sample_token_greedy(ctx, &candidates_p);
// is it an end of generation? -> mark the stream as finished
if (llama_token_is_eog(model, new_token_id) || n_cur == n_len) {
if (llama_token_is_eog(model, new_token_id) || n_cur == n_predict) {
i_batch[i] = -1;
LOG_TEE("\n");
if (n_parallel > 1) {

View File

@@ -1,5 +0,0 @@
set(TARGET beam-search)
add_executable(${TARGET} beam-search.cpp)
install(TARGETS ${TARGET} RUNTIME)
target_link_libraries(${TARGET} PRIVATE common llama ${CMAKE_THREAD_LIBS_INIT})
target_compile_features(${TARGET} PRIVATE cxx_std_11)

View File

@@ -1,188 +0,0 @@
#include "common.h"
#include "llama.h"
#include <cassert>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <ctime>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
#include <signal.h>
#include <unistd.h>
#elif defined (_WIN32)
#define WIN32_LEAN_AND_MEAN
#ifndef NOMINMAX
# define NOMINMAX
#endif
#include <windows.h>
#include <signal.h>
#endif
// Used for debugging to print out beam tokens.
struct ostream_beam_view {
llama_context * ctx;
llama_beam_view beam_view;
};
static std::ostream & operator<<(std::ostream & os, const ostream_beam_view & obv) {
os << "p(" << obv.beam_view.p << ") eob(" << std::boolalpha << obv.beam_view.eob << ") tokens(";
for (size_t i = 0 ; i < obv.beam_view.n_tokens ; ++i) {
os << llama_token_to_piece(obv.ctx, obv.beam_view.tokens[i]);
}
return os << ')';
}
// Put here anything you want back in beam_search_callback().
struct beam_search_callback_data {
llama_context * ctx;
std::vector<llama_token> response;
};
// In this case, end-of-beam (eob) is equivalent to end-of-sentence (eos) but this need not always be the same.
// For example, eob can be flagged due to maximum token length, stop words, etc.
static bool is_at_eob(const beam_search_callback_data & callback_data, const llama_token * tokens, size_t n_tokens) {
return n_tokens && llama_token_is_eog(llama_get_model(callback_data.ctx), tokens[n_tokens-1]);
}
// Function matching type llama_beam_search_callback_fn_t.
// Custom callback example is called each time the beams lengths increase:
// * Show progress by printing ',' following by number of convergent beam tokens if any.
// * When all beams converge to a common prefix, they are made available in beams_state.beams[0].
// This is also called when the stop condition is met.
// Collect tokens into std::vector<llama_token> response which is pointed to by callback_data.
static void beam_search_callback(void * callback_data_ptr, llama_beams_state beams_state) {
auto& callback_data = *static_cast<beam_search_callback_data*>(callback_data_ptr);
// Mark beams as EOS as needed.
for (size_t i = 0 ; i < beams_state.n_beams ; ++i) {
llama_beam_view& beam_view = beams_state.beam_views[i];
if (!beam_view.eob && is_at_eob(callback_data, beam_view.tokens, beam_view.n_tokens)) {
beam_view.eob = true;
}
}
printf(","); // Show progress
if (const size_t n = beams_state.common_prefix_length) {
callback_data.response.resize(callback_data.response.size() + n);
assert(0u < beams_state.n_beams);
const llama_token * tokens = beams_state.beam_views[0].tokens;
std::copy(tokens, tokens + n, callback_data.response.end() - n);
printf("%zu", n);
}
fflush(stdout);
#if 1 // DEBUG: print current beams for this iteration
std::cout << "\n\nCurrent beams (last_call=" << beams_state.last_call << "):\n";
for (size_t i = 0 ; i < beams_state.n_beams ; ++i) {
std::cout << "beams["<<i<<"]: " << ostream_beam_view{callback_data.ctx,beams_state.beam_views[i]} << std::endl;
}
#endif
}
int main(int argc, char ** argv)
{
gpt_params params;
//params.n_gpu_layers = 200;
//---------------------------------
// Print help :
//---------------------------------
if ( argc < 2 || argv[1][0] == '-' )
{
printf( "Usage: %s MODEL_PATH [BEAM_WIDTH=2] [PROMPT]\n" , argv[0] );
return 1 ;
}
//---------------------------------
// Load parameters :
//---------------------------------
params.model = argv[1];
params.n_beams = 2 < argc ? std::stoi(argv[2]) : 2;
if ( argc > 3 )
{
params.prompt = argv[3];
}
if ( params.prompt.empty() )
{
params.prompt = "### Request:\nHow many countries are there?\n\n### Response:\n";
}
//---------------------------------
// Init LLM :
//---------------------------------
llama_backend_init();
llama_numa_init(params.numa);
llama_model * model;
llama_context * ctx;
std::tie(model, ctx) = llama_init_from_gpt_params( params );
if ( model == NULL )
{
fprintf( stderr , "%s: error: unable to load model\n" , __func__ );
return 1;
}
//---------------------------------
// Tokenize the prompt :
//---------------------------------
std::vector<llama_token> tokens_list = llama_tokenize(ctx, params.prompt, true);
const size_t max_context_size = llama_n_ctx( ctx );
const size_t max_tokens_list_size = max_context_size - 4 ;
if (tokens_list.size() > max_tokens_list_size)
{
fprintf( stderr , "%s: error: prompt too long (%zu tokens, max %zu)\n" ,
__func__ , tokens_list.size() , max_tokens_list_size );
return 1;
}
fprintf( stderr, "\n\n" );
// Print the tokens from the prompt :
for( auto id : tokens_list )
{
std::cout << llama_token_to_piece(ctx, id);
}
std::cout << std::flush;
int n_past = 0;
if (llama_decode(ctx, llama_batch_get_one(tokens_list.data(), tokens_list.size(), n_past, 0)))
{
fprintf(stderr, "%s : failed to eval prompt.\n" , __func__ );
return 1;
}
n_past += tokens_list.size();
beam_search_callback_data callback_data{ctx, {}};
size_t const beam_width = static_cast<size_t>(params.n_beams);
int const n_predict = 256;
llama_beam_search(ctx, beam_search_callback, &callback_data, beam_width, n_past, n_predict);
std::cout << "\n\n";
for (llama_token const token_id : callback_data.response) {
std::cout << llama_token_to_piece(ctx,token_id);
}
std::cout << std::endl;
llama_free( ctx );
llama_free_model( model );
llama_backend_free();
return 0;
}

View File

@@ -176,7 +176,7 @@ class Params:
rope_scaling_type: gguf.RopeScalingType | None = None
f_rope_freq_base: float | None = None
f_rope_scale: float | None = None
n_orig_ctx: int | None = None
n_ctx_orig: int | None = None
rope_finetuned: bool | None = None
ftype: GGMLFileType | None = None
@@ -226,7 +226,7 @@ class Params:
with open(config_path) as f:
config = json.load(f)
rope_scaling_type = f_rope_scale = n_orig_ctx = rope_finetuned = None
rope_scaling_type = f_rope_scale = n_ctx_orig = rope_finetuned = None
rope_scaling = config.get("rope_scaling")
if rope_scaling is not None and (typ := rope_scaling.get("type")):
@@ -236,7 +236,7 @@ class Params:
rope_scaling_type = gguf.RopeScalingType.LINEAR
elif typ == "yarn":
rope_scaling_type = gguf.RopeScalingType.YARN
n_orig_ctx = rope_scaling['original_max_position_embeddings']
n_ctx_orig = rope_scaling['original_max_position_embeddings']
rope_finetuned = rope_scaling['finetuned']
else:
raise NotImplementedError(f'Unknown rope scaling type: {typ}')
@@ -272,7 +272,7 @@ class Params:
f_rope_freq_base = config.get("rope_theta"),
rope_scaling_type = rope_scaling_type,
f_rope_scale = f_rope_scale,
n_orig_ctx = n_orig_ctx,
n_ctx_orig = n_ctx_orig,
rope_finetuned = rope_finetuned,
)
@@ -864,8 +864,8 @@ class OutputFile:
self.gguf.add_rope_scaling_type(params.rope_scaling_type)
self.gguf.add_rope_scaling_factor(params.f_rope_scale)
if params.n_orig_ctx is not None:
self.gguf.add_rope_scaling_orig_ctx_len(params.n_orig_ctx)
if params.n_ctx_orig is not None:
self.gguf.add_rope_scaling_orig_ctx_len(params.n_ctx_orig)
if params.rope_finetuned is not None:
self.gguf.add_rope_scaling_finetuned(params.rope_finetuned)

View File

@@ -63,6 +63,7 @@ int main(int argc, char ** argv) {
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
@@ -79,9 +80,6 @@ int main(int argc, char ** argv) {
fprintf(stderr, "%s: seed = %u\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = string_random_prompt(rng);
}
llama_backend_init();
llama_numa_init(params.numa);

View File

@@ -140,20 +140,18 @@ static bool run(llama_context * ctx, const gpt_params & params) {
}
int main(int argc, char ** argv) {
callback_data cb_data;
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
print_build_info();
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = string_random_prompt(rng);
}
llama_backend_init();
llama_numa_init(params.numa);

View File

@@ -564,7 +564,7 @@ static struct ggml_tensor * llama_build_lora_finetune_graphs(
const int rope_mode = 0;
return ggml_rope_ext(ctx,
t, KQ_pos, nullptr, n_rot, rope_mode, n_ctx, 0,
t, KQ_pos, nullptr, n_rot, rope_mode, n_ctx,
rope_freq_base, rope_freq_scale, 0.0f, 1.0f, 0.0f, 0.0f
);
};

View File

@@ -61,10 +61,10 @@ static size_t split_str_to_n_bytes(std::string str) {
int n;
if (str.back() == 'M') {
sscanf(str.c_str(), "%d", &n);
n_bytes = (size_t)n * 1024 * 1024; // megabytes
n_bytes = (size_t)n * 1000 * 1000; // megabytes
} else if (str.back() == 'G') {
sscanf(str.c_str(), "%d", &n);
n_bytes = (size_t)n * 1024 * 1024 * 1024; // gigabytes
n_bytes = (size_t)n * 1000 * 1000 * 1000; // gigabytes
} else {
throw std::invalid_argument("error: supported units are M (megabytes) or G (gigabytes), but got: " + std::string(1, str.back()));
}
@@ -284,7 +284,7 @@ struct split_strategy {
struct ggml_tensor * t = ggml_get_tensor(ctx_meta, gguf_get_tensor_name(ctx_out, i));
total_size += ggml_nbytes(t);
}
total_size = total_size / 1024 / 1024; // convert to megabytes
total_size = total_size / 1000 / 1000; // convert to megabytes
printf("split %05d: n_tensors = %d, total_size = %ldM\n", i_split + 1, gguf_get_n_tensors(ctx_out), total_size);
i_split++;
}

View File

@@ -41,7 +41,7 @@ echo PASS
echo
# 2b. Test the sharded model is loading properly
$MAIN --model $WORK_PATH/ggml-model-split-00001-of-00006.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-split-00001-of-00006.gguf --n-predict 32
echo PASS
echo
@@ -51,7 +51,7 @@ echo PASS
echo
# 3b. Test the merged model is loading properly
$MAIN --model $WORK_PATH/ggml-model-merge.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-merge.gguf --n-predict 32
echo PASS
echo
@@ -61,7 +61,7 @@ echo PASS
echo
# 4b. Test the sharded model is loading properly
$MAIN --model $WORK_PATH/ggml-model-split-32-tensors-00001-of-00007.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-split-32-tensors-00001-of-00007.gguf --n-predict 32
echo PASS
echo
@@ -71,7 +71,7 @@ echo
#echo
# 5b. Test the merged model is loading properly
#$MAIN --model $WORK_PATH/ggml-model-merge-2.gguf --random-prompt --n-predict 32
#$MAIN --model $WORK_PATH/ggml-model-merge-2.gguf --n-predict 32
#echo PASS
#echo
@@ -81,7 +81,7 @@ echo PASS
echo
# 6b. Test the sharded model is loading properly
$MAIN --model $WORK_PATH/ggml-model-split-2G-00001-of-00002.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-split-2G-00001-of-00002.gguf --n-predict 32
echo PASS
echo

View File

@@ -1,15 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./main --color --instruct --threads 4 \
--model ./models/gpt4all-7B/gpt4all-lora-quantized.bin \
--file ./prompts/alpaca.txt \
--batch_size 8 --ctx_size 2048 -n -1 \
--repeat_last_n 64 --repeat_penalty 1.3 \
--n_predict 128 --temp 0.1 --top_k 40 --top_p 0.95

View File

@@ -153,7 +153,9 @@ static std::string gritlm_instruction(const std::string & instruction) {
int main(int argc, char * argv[]) {
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -6,16 +6,19 @@ More information is available here: https://github.com/ggerganov/llama.cpp/pull/
## Usage
```
./imatrix -m <some_fp_model> -f <some_training_data> [-o <output_file>] [--verbosity <verbosity_level>]
[-ofreq num_chunks] [-ow <0 or 1>] [other common params]
./imatrix \
-m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \
[--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \
[--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]
```
Here `-m` with a model name and `-f` with a file containing training data (such as e.g. `wiki.train.raw`) are mandatory.
The parameters in square brackets are optional and have the following meaning:
* `-o` (or `--output-file`) specifies the name of the file where the computed data will be stored. If missing `imatrix.dat` is used.
* `--verbosity` specifies the verbosity level. If set to `0`, no output other than the perplexity of the processed chunks will be generated. If set to `1`, each time the results are saved a message is written to `stderr`. If `>=2`, a message is output each time data is collected for any tensor. Default verbosity level is `1`.
* `-ofreq` (or `--output-frequency`) specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
* `-ow` (or `--output-weight`) specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
* `--output-frequency` specifies how often the so far computed result is saved to disk. Default is 10 (i.e., every 10 chunks)
* `--save-frequency` specifies how often to save a copy of the imatrix in a separate file. Default is 0 (i.e., never)
* `--process-output` specifies if data will be collected for the `output.weight` tensor. My experience is that it is better to not utilize the importance matrix when quantizing `output.weight`, so this is set to `false` by default.
For faster computation, make sure to use GPU offloading via the `-ngl` argument

View File

@@ -17,39 +17,37 @@
#pragma warning(disable: 4244 4267) // possible loss of data
#endif
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s \\\n"
" -m model.gguf -f some-text.txt [-o imatrix.dat] [--process-output] [--verbosity 1] \\\n"
" [--no-ppl] [--chunk 123] [--output-frequency 10] [--save-frequency 0] \\\n"
" [--in-file imatrix-prev-0.dat --in-file imatrix-prev-1.dat ...]\n" , argv[0]);
LOG_TEE("\n");
}
struct Stats {
std::vector<float> values;
std::vector<int> counts;
int ncall = 0;
};
struct StatParams {
std::string dataset;
std::string ofile = "imatrix.dat";
int n_output_frequency = 10;
int verbosity = 1;
int keep_every = 0;
bool collect_output_weight = false;
};
class IMatrixCollector {
public:
IMatrixCollector() = default;
void set_parameters(StatParams&& params) { m_params = std::move(params); }
void set_params(gpt_params params) { m_params = std::move(params); }
bool collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data);
void save_imatrix() const;
bool load_imatrix(const char * file_name, bool add);
static bool load_imatrix(const char * file_name, std::unordered_map<std::string, Stats>& imatrix);
void save_imatrix(int ncall = -1) const;
bool load_imatrix(const char * file_name);
private:
std::unordered_map<std::string, Stats> m_stats;
StatParams m_params;
gpt_params m_params;
std::mutex m_mutex;
int m_last_call = 0;
std::vector<float> m_src1_data;
std::vector<char> m_ids; // the expert ids from ggml_mul_mat_id
//
void save_imatrix(const char * file_name, const char * dataset) const;
void keep_imatrix(int ncall) const;
};
// remove any prefix and suffixes from the name
@@ -85,7 +83,7 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
if (t->op != GGML_OP_MUL_MAT) return false;
// why are small batches ignored (<16 tokens)?
if (src1->ne[1] < 16 || src1->type != GGML_TYPE_F32) return false;
if (!(wname.substr(0, 4) == "blk." || (m_params.collect_output_weight && wname == "output.weight"))) return false;
if (!(wname.substr(0, 4) == "blk." || (m_params.process_output && wname == "output.weight"))) return false;
return true;
}
@@ -153,21 +151,25 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
for (int j = 0; j < (int)src1->ne[0]; ++j) {
e.values[e_start + j] += x[j]*x[j];
e.counts[e_start + j]++;
if (!std::isfinite(e.values[e_start + j])) {
fprintf(stderr, "%f detected in %s\n", e.values[e_start + j], wname.c_str());
exit(1);
}
}
}
}
if (e.ncall > m_last_call) {
m_last_call = e.ncall;
if (m_last_call % m_params.n_output_frequency == 0) {
if (m_last_call % m_params.n_out_freq == 0) {
save_imatrix();
}
if (m_params.keep_every > 0 && m_last_call%m_params.keep_every == 0) {
keep_imatrix(m_last_call);
if (m_params.n_save_freq > 0 && m_last_call%m_params.n_save_freq == 0) {
save_imatrix(m_last_call);
}
}
}
} else {
auto& e = m_stats[wname];
auto & e = m_stats[wname];
if (e.values.empty()) {
e.values.resize(src1->ne[0], 0);
e.counts.resize(src1->ne[0], 0);
@@ -185,15 +187,19 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
for (int j = 0; j < (int)src1->ne[0]; ++j) {
e.values[j] += x[j]*x[j];
e.counts[j]++;
if (!std::isfinite(e.values[j])) {
fprintf(stderr, "%f detected in %s\n", e.values[j], wname.c_str());
exit(1);
}
}
}
if (e.ncall > m_last_call) {
m_last_call = e.ncall;
if (m_last_call % m_params.n_output_frequency == 0) {
if (m_last_call % m_params.n_out_freq == 0) {
save_imatrix();
}
if (m_params.keep_every > 0 && m_last_call%m_params.keep_every == 0) {
keep_imatrix(m_last_call);
if (m_params.n_save_freq > 0 && m_last_call%m_params.n_save_freq == 0) {
save_imatrix(m_last_call);
}
}
}
@@ -201,33 +207,75 @@ bool IMatrixCollector::collect_imatrix(struct ggml_tensor * t, bool ask, void *
return true;
}
void IMatrixCollector::save_imatrix() const {
save_imatrix(m_params.ofile.empty() ? "imatrix.dat" : m_params.ofile.c_str(), m_params.dataset.c_str());
}
void IMatrixCollector::save_imatrix(int ncall) const {
auto fname = m_params.out_file;
if (fname.empty()) {
fname = "imatrix.dat";
}
void IMatrixCollector::keep_imatrix(int ncall) const {
auto file_name = m_params.ofile;
if (file_name.empty()) file_name = "imatrix.dat";
file_name += ".at_";
file_name += std::to_string(ncall);
save_imatrix(file_name.c_str(), m_params.dataset.c_str());
}
if (ncall > 0) {
fname += ".at_";
fname += std::to_string(ncall);
}
// avoid writing imatrix entries that do not have full data
// this can happen with MoE models where some of the experts end up not being exercised by the provided training data
int n_entries = 0;
std::vector<std::string> to_store;
bool is_first = true; // for printing
for (const auto & kv : m_stats) {
const int n_all = kv.second.counts.size();
if (n_all == 0) {
continue;
}
int n_zeros = 0;
for (const int c : kv.second.counts) {
if (c == 0) {
n_zeros++;
}
}
if (n_zeros != 0 && is_first) {
fprintf(stderr, "\n");
is_first = false;
}
if (n_zeros == n_all) {
fprintf(stderr, "%s: entry '%40s' has no data - skipping\n", __func__, kv.first.c_str());
continue;
}
if (n_zeros > 0) {
fprintf(stderr, "%s: entry '%40s' has partial data (%.2f%%) - skipping\n", __func__, kv.first.c_str(), 100.0f * (n_all - n_zeros) / n_all);
continue;
}
n_entries++;
to_store.push_back(kv.first);
}
if (to_store.size() < m_stats.size()) {
fprintf(stderr, "%s: warning: storing only %zu out of %zu entries\n", __func__, to_store.size(), m_stats.size());
}
void IMatrixCollector::save_imatrix(const char * fname, const char * dataset) const {
std::ofstream out(fname, std::ios::binary);
int n_entries = m_stats.size();
out.write((const char *) &n_entries, sizeof(n_entries));
for (const auto & p : m_stats) {
int len = p.first.size();
for (const auto & name : to_store) {
const auto & stat = m_stats.at(name);
int len = name.size();
out.write((const char *) &len, sizeof(len));
out.write(p.first.c_str(), len);
out.write((const char *) &p.second.ncall, sizeof(p.second.ncall));
int nval = p.second.values.size();
out.write(name.c_str(), len);
out.write((const char *) &stat.ncall, sizeof(stat.ncall));
int nval = stat.values.size();
out.write((const char *) &nval, sizeof(nval));
if (nval > 0) {
std::vector<float> tmp(nval);
for (int i = 0; i < nval; i++) {
tmp[i] = (p.second.values[i] / static_cast<float>(p.second.counts[i])) * static_cast<float>(p.second.ncall);
tmp[i] = (stat.values[i] / static_cast<float>(stat.counts[i])) * static_cast<float>(stat.ncall);
}
out.write((const char*)tmp.data(), nval*sizeof(float));
}
@@ -236,26 +284,28 @@ void IMatrixCollector::save_imatrix(const char * fname, const char * dataset) co
// Write the number of call the matrix was computed with
out.write((const char *) &m_last_call, sizeof(m_last_call));
// Write the dataset name at the end of the file to later on specify it in quantize
int n_dataset = strlen(dataset);
out.write((const char *) &n_dataset, sizeof(n_dataset));
out.write(dataset, n_dataset);
// Write the input filename at the end of the file to later on specify it in quantize
{
int len = m_params.prompt_file.size();
out.write((const char *) &len, sizeof(len));
out.write(m_params.prompt_file.c_str(), len);
}
if (m_params.verbosity > 0) {
fprintf(stderr, "\n%s: stored collected data after %d chunks in %s\n", __func__, m_last_call, fname);
fprintf(stderr, "\n%s: stored collected data after %d chunks in %s\n", __func__, m_last_call, fname.c_str());
}
}
bool IMatrixCollector::load_imatrix(const char * imatrix_file, std::unordered_map<std::string, Stats>& imatrix_data) {
std::ifstream in(imatrix_file, std::ios::binary);
bool IMatrixCollector::load_imatrix(const char * fname) {
std::ifstream in(fname, std::ios::binary);
if (!in) {
printf("%s: failed to open %s\n",__func__,imatrix_file);
printf("%s: failed to open %s\n",__func__, fname);
return false;
}
int n_entries;
in.read((char*)&n_entries, sizeof(n_entries));
if (in.fail() || n_entries < 1) {
printf("%s: no data in file %s\n", __func__, imatrix_file);
printf("%s: no data in file %s\n", __func__, fname);
return false;
}
for (int i = 0; i < n_entries; ++i) {
@@ -263,23 +313,22 @@ bool IMatrixCollector::load_imatrix(const char * imatrix_file, std::unordered_ma
std::vector<char> name_as_vec(len+1);
in.read((char *)name_as_vec.data(), len);
if (in.fail()) {
printf("%s: failed reading name for entry %d from %s\n",__func__,i+1,imatrix_file);
printf("%s: failed reading name for entry %d from %s\n",__func__,i+1, fname);
return false;
}
name_as_vec[len] = 0;
std::string name{name_as_vec.data()};
auto& e = imatrix_data[std::move(name)];
auto & e = m_stats[std::move(name)];
int ncall;
in.read((char*)&ncall, sizeof(ncall));
int nval;
in.read((char *)&nval, sizeof(nval));
if (in.fail() || nval < 1) {
printf("%s: failed reading number of values for entry %d\n",__func__,i);
imatrix_data = {};
m_stats = {};
return false;
}
// When re-called from load_imatrix() with add set, this will already be created.
if (e.values.empty()) {
e.values.resize(nval, 0);
e.counts.resize(nval, 0);
@@ -289,7 +338,7 @@ bool IMatrixCollector::load_imatrix(const char * imatrix_file, std::unordered_ma
in.read((char*)tmp.data(), nval*sizeof(float));
if (in.fail()) {
printf("%s: failed reading data for entry %d\n",__func__,i);
imatrix_data = {};
m_stats = {};
return false;
}
@@ -304,13 +353,6 @@ bool IMatrixCollector::load_imatrix(const char * imatrix_file, std::unordered_ma
return true;
}
bool IMatrixCollector::load_imatrix(const char * file_name, bool add) {
if (!add) {
m_stats.clear();
}
return load_imatrix(file_name, m_stats);
}
static IMatrixCollector g_collector;
static bool ik_collect_imatrix(struct ggml_tensor * t, bool ask, void * user_data) {
@@ -324,7 +366,7 @@ struct results_log_softmax {
float prob;
};
static std::vector<float> softmax(const std::vector<float>& logits) {
static std::vector<float> softmax(const std::vector<float> & logits) {
std::vector<float> probs(logits.size());
float max_logit = logits[0];
for (float v : logits) {
@@ -358,8 +400,7 @@ static results_log_softmax log_softmax(int n_vocab, const float * logits, int to
static void process_logits(
int n_vocab, const float * logits, const int * tokens, int n_token, std::vector<std::thread> & workers,
double & nll, double & nll2, float * logit_history, float * prob_history
) {
double & nll, double & nll2, float * logit_history, float * prob_history) {
std::mutex mutex;
int counter = 0;
auto compute = [&mutex, &counter, &nll, &nll2, logit_history, prob_history, n_vocab, logits, tokens, n_token] () {
@@ -391,8 +432,7 @@ static void process_logits(
}
}
static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool compute_ppl, int from_chunk) {
static bool compute_imatrix(llama_context * ctx, const gpt_params & params) {
const bool add_bos = llama_should_add_bos_token(llama_get_model(ctx));
GGML_ASSERT(llama_add_eos_token(llama_get_model(ctx)) != 1);
const int n_ctx = llama_n_ctx(ctx);
@@ -405,13 +445,13 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
auto tim2 = std::chrono::high_resolution_clock::now();
fprintf(stderr, "%s: tokenization took %g ms\n",__func__,1e-3*std::chrono::duration_cast<std::chrono::microseconds>(tim2-tim1).count());
if (from_chunk > 0) {
if (size_t((from_chunk + 2)*n_ctx) >= tokens.size()) {
fprintf(stderr, "%s: there will be not enough tokens left after removing %d chunks\n", __func__, from_chunk);
if (params.i_chunk > 0) {
if (size_t((params.i_chunk + 2)*n_ctx) >= tokens.size()) {
fprintf(stderr, "%s: there will be not enough tokens left after removing %d chunks\n", __func__, params.i_chunk);
return false;
}
fprintf(stderr, "%s: removing initial %d chunks (%d tokens)\n", __func__, from_chunk, from_chunk*n_ctx);
tokens.erase(tokens.begin(), tokens.begin() + from_chunk*n_ctx);
fprintf(stderr, "%s: removing initial %d chunks (%d tokens)\n", __func__, params.i_chunk, params.i_chunk*n_ctx);
tokens.erase(tokens.begin(), tokens.begin() + params.i_chunk*n_ctx);
}
if (int(tokens.size()) < 2*n_ctx) {
@@ -424,7 +464,7 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
std::vector<float> logit_history;
std::vector<float> prob_history;
if (compute_ppl) {
if (params.compute_ppl) {
logit_history.resize(tokens.size());
prob_history.resize(tokens.size());
}
@@ -446,7 +486,7 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
const int num_batches = (n_ctx + n_batch - 1) / n_batch;
std::vector<float> logits;
if (compute_ppl && num_batches > 1) {
if (params.compute_ppl && num_batches > 1) {
logits.reserve((size_t)n_ctx * n_vocab);
}
@@ -482,7 +522,7 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
// restore the original token in case it was set to BOS
tokens[batch_start] = token_org;
if (compute_ppl && num_batches > 1) {
if (params.compute_ppl && num_batches > 1) {
const auto * batch_logits = llama_get_logits(ctx);
logits.insert(logits.end(), batch_logits, batch_logits + batch_size * n_vocab);
}
@@ -501,7 +541,7 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
fprintf(stderr, "%.2f minutes\n", total_seconds / 60.0);
}
if (compute_ppl) {
if (params.compute_ppl) {
const int first = n_ctx/2;
const auto all_logits = num_batches > 1 ? logits.data() : llama_get_logits(ctx);
process_logits(n_vocab, all_logits + first*n_vocab, tokens.data() + start + first, n_ctx - 1 - first,
@@ -516,7 +556,7 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
}
printf("\n");
if (compute_ppl) {
if (params.compute_ppl) {
nll2 /= count;
nll /= count;
const double ppl = exp(nll);
@@ -533,111 +573,32 @@ static bool compute_imatrix(llama_context * ctx, const gpt_params & params, bool
}
int main(int argc, char ** argv) {
StatParams sparams;
std::string prev_result_file;
std::string combine_files;
bool compute_ppl = true;
int from_chunk = 0;
std::vector<char*> args;
args.push_back(argv[0]);
int iarg = 1;
for (; iarg < argc-1; ++iarg) {
std::string arg{argv[iarg]};
if (arg == "-o" || arg == "--output-file") {
sparams.ofile = argv[++iarg];
}
else if (arg == "-ofreq" || arg == "--output-frequency") {
sparams.n_output_frequency = std::stoi(argv[++iarg]);
}
else if (arg == "-ow" || arg == "--output-weight") {
sparams.collect_output_weight = std::stoi(argv[++iarg]);
}
else if (arg == "--verbosity") {
sparams.verbosity = std::stoi(argv[++iarg]);
} else if (arg == "--no-ppl") {
compute_ppl = false;
} else if (arg == "--keep-imatrix") {
sparams.keep_every = std::stoi(argv[++iarg]);
} else if (arg == "--continue-from") {
prev_result_file = argv[++iarg];
} else if (arg == "--combine") {
combine_files = argv[++iarg];
}
else if (arg == "--from-chunk") {
from_chunk = std::stoi(argv[++iarg]);
} else {
args.push_back(argv[iarg]);
}
}
if (iarg < argc) {
std::string arg{argv[iarg]};
if (arg == "--no-ppl") {
compute_ppl = false;
} else {
args.push_back(argv[iarg]);
}
}
gpt_params params;
params.n_batch = 512;
if (!gpt_params_parse(args.size(), args.data(), params)) {
params.n_ctx = 512;
params.logits_all = true;
params.verbosity = 1;
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
params.logits_all = true;
params.n_batch = std::min(params.n_batch, params.n_ctx);
print_build_info();
g_collector.set_params(params);
if (params.seed == LLAMA_DEFAULT_SEED) {
params.seed = time(NULL);
}
fprintf(stderr, "%s: seed = %u\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = string_random_prompt(rng);
}
sparams.dataset = params.prompt_file;
g_collector.set_parameters(std::move(sparams));
if (!combine_files.empty()) {
std::vector<std::string> files;
size_t pos = 0;
while (true) {
auto new_pos = combine_files.find(',', pos);
if (new_pos != std::string::npos) {
files.emplace_back(combine_files.substr(pos, new_pos - pos));
pos = new_pos + 1;
} else {
files.emplace_back(combine_files.substr(pos));
break;
}
}
if (files.size() < 2) {
fprintf(stderr, "You must provide at least two comma separated files to use --combine\n");
for (const auto & in_file : params.in_files) {
printf("%s : loading imatrix from '%s'\n", __func__, in_file.c_str());
if (!g_collector.load_imatrix(in_file.c_str())) {
fprintf(stderr, "%s : failed to load %s\n", __func__, in_file.c_str());
return 1;
}
printf("Combining the following %d files\n", int(files.size()));
for (auto& file : files) {
printf(" %s\n", file.c_str());
if (!g_collector.load_imatrix(file.c_str(), true)) {
fprintf(stderr, "Failed to load %s\n", file.c_str());
return 1;
}
}
}
if (params.in_files.size() > 1) {
printf("%s : saving combined imatrix to '%s'\n", __func__, params.out_file.c_str());
g_collector.save_imatrix();
return 0;
}
if (!prev_result_file.empty()) {
if (!g_collector.load_imatrix(prev_result_file.c_str(), false)) {
fprintf(stderr, "=============== Failed to load %s\n", prev_result_file.c_str());
return 1;
}
}
llama_backend_init();
@@ -652,6 +613,7 @@ int main(int argc, char ** argv) {
// init
llama_model * model;
llama_context * ctx;
std::tie(model, ctx) = llama_init_from_gpt_params(params);
if (model == nullptr || ctx == nullptr) {
fprintf(stderr, "%s : failed to init\n", __func__);
@@ -670,8 +632,7 @@ int main(int argc, char ** argv) {
fprintf(stderr, "%s\n", gpt_params_get_system_info(params).c_str());
}
bool OK = compute_imatrix(ctx, params, compute_ppl, from_chunk);
if (!OK) {
if (!compute_imatrix(ctx, params)) {
return 1;
}

View File

@@ -107,6 +107,7 @@ int main(int argc, char ** argv) {
g_params = &params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
@@ -139,27 +140,6 @@ int main(int argc, char ** argv) {
LOG_TEE("%s: warning: minimum context size is 8, using minimum size.\n", __func__);
params.n_ctx = 8;
}
if (params.instruct) {
printf("\n************\n");
printf("%s: please use the 'main' tool for instruct mode\n", __func__);
printf("************\n\n");
return 0;
}
if (params.chatml) {
printf("\n************\n");
printf("%s: please use the 'main' tool for chatml mode\n", __func__);
printf("************\n\n");
return 0;
}
if (!params.antiprompt.empty()) {
printf("\n************\n");
printf("%s: please use the 'main' tool for antiprompt mode\n", __func__);
printf("************\n\n");
return 0;
}
if (!params.interactive_first && (params.input_prefix.empty() && params.input_suffix.empty())) {
printf("\n************\n");
printf("%s: please use '--interactive_first' or specify '--in_prefix' and/or '--in_suffix'\n", __func__);
@@ -167,20 +147,6 @@ int main(int argc, char ** argv) {
return 0;
}
if (params.random_prompt) {
printf("\n************\n");
printf("%s: please use the 'main' tool for random prompt mode\n", __func__);
printf("************\n\n");
return 0;
}
if (!params.path_prompt_cache.empty()) {
printf("\n************\n");
printf("%s: infill does not support prompt caching\n", __func__);
printf("************\n\n");
return 0;
}
if (params.rope_freq_base != 0.0) {
LOG_TEE("%s: warning: changing RoPE frequency base to %g.\n", __func__, params.rope_freq_base);
@@ -207,17 +173,13 @@ int main(int argc, char ** argv) {
llama_model * model;
llama_context * ctx;
llama_context * ctx_guidance = NULL;
g_model = &model;
g_ctx = &ctx;
// load the model and apply lora adapter, if any
LOG("%s: load the model and apply lora adapter, if any\n", __func__);
std::tie(model, ctx) = llama_init_from_gpt_params(params);
if (sparams.cfg_scale > 1.f) {
struct llama_context_params lparams = llama_context_params_from_gpt_params(params);
ctx_guidance = llama_new_context_with_model(model, lparams);
}
if (model == NULL) {
LOG_TEE("%s: error: unable to load model\n", __func__);
@@ -273,25 +235,6 @@ int main(int argc, char ** argv) {
LOG("embd_inp was considered empty and bos was added: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd_inp).c_str());
}
// Tokenize negative prompt
std::vector<llama_token> guidance_inp;
int guidance_offset = 0;
int original_prompt_len = 0;
if (ctx_guidance) {
LOG("cfg_negative_prompt: \"%s\"\n", log_tostr(sparams.cfg_negative_prompt));
guidance_inp = ::llama_tokenize(ctx_guidance, sparams.cfg_negative_prompt, true);
LOG("guidance_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx_guidance, guidance_inp).c_str());
std::vector<llama_token> original_inp = ::llama_tokenize(ctx, params.prompt, true);
LOG("original_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, original_inp).c_str());
original_prompt_len = original_inp.size();
guidance_offset = (int)guidance_inp.size() - original_prompt_len;
LOG("original_prompt_len: %s", log_tostr(original_prompt_len));
LOG("guidance_offset: %s", log_tostr(guidance_offset));
}
if ((int) embd_inp.size() > n_ctx - 4) {
LOG_TEE("%s: error: prompt is too long (%d tokens, max %d)\n", __func__, (int) embd_inp.size(), n_ctx - 4);
return 1;
@@ -319,15 +262,6 @@ int main(int argc, char ** argv) {
LOG_TEE("%6d -> '%s'\n", embd_inp[i], llama_token_to_piece(ctx, embd_inp[i]).c_str());
}
if (ctx_guidance) {
LOG_TEE("\n");
LOG_TEE("%s: negative prompt: '%s'\n", __func__, sparams.cfg_negative_prompt.c_str());
LOG_TEE("%s: number of tokens in negative prompt = %zu\n", __func__, guidance_inp.size());
for (int i = 0; i < (int) guidance_inp.size(); i++) {
LOG_TEE("%6d -> '%s'\n", guidance_inp[i], llama_token_to_piece(ctx, guidance_inp[i]).c_str());
}
}
if (params.n_keep > 0) {
LOG_TEE("%s: static prompt based on n_keep: '", __func__);
for (int i = 0; i < params.n_keep; i++) {
@@ -395,12 +329,11 @@ int main(int argc, char ** argv) {
is_interacting = params.interactive_first;
}
bool input_echo = true;
bool input_echo = true;
int n_past = 0;
int n_remain = params.n_predict;
int n_consumed = 0;
int n_past_guidance = 0;
int n_past = 0;
int n_remain = params.n_predict;
int n_consumed = 0;
std::vector<int> input_tokens; g_input_tokens = &input_tokens;
std::vector<int> output_tokens; g_output_tokens = &output_tokens;
@@ -410,7 +343,6 @@ int main(int argc, char ** argv) {
console::set_display(console::prompt);
std::vector<llama_token> embd;
std::vector<llama_token> embd_guidance;
struct llama_sampling_context * ctx_sampling = llama_sampling_init(sparams);
@@ -436,7 +368,7 @@ int main(int argc, char ** argv) {
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
if (n_past + (int) embd.size() + std::max<int>(0, guidance_offset) > n_ctx) {
if (n_past + (int) embd.size() > n_ctx) {
if (params.n_predict == -2) {
LOG_TEE("\n\n%s: context full and n_predict == -%d => stopping\n", __func__, params.n_predict);
break;
@@ -453,11 +385,7 @@ int main(int argc, char ** argv) {
n_past -= n_discard;
if (ctx_guidance) {
n_past_guidance -= n_discard;
}
LOG("after swap: n_past = %d, n_past_guidance = %d\n", n_past, n_past_guidance);
LOG("after swap: n_past = %d\n", n_past);
LOG("embd: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str());
@@ -465,45 +393,6 @@ int main(int argc, char ** argv) {
// evaluate tokens in batches
// embd is typically prepared beforehand to fit within a batch, but not always
if (ctx_guidance) {
int input_size = 0;
llama_token * input_buf = NULL;
if (n_past_guidance < (int) guidance_inp.size()) {
// Guidance context should have the same data with these modifications:
//
// * Replace the initial prompt
// * Shift everything by guidance_offset
embd_guidance = guidance_inp;
if (embd.begin() + original_prompt_len < embd.end()) {
embd_guidance.insert(
embd_guidance.end(),
embd.begin() + original_prompt_len,
embd.end()
);
}
input_buf = embd_guidance.data();
input_size = embd_guidance.size();
LOG("guidance context: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd_guidance).c_str());
} else {
input_buf = embd.data();
input_size = embd.size();
}
for (int i = 0; i < input_size; i += params.n_batch) {
int n_eval = std::min(input_size - i, params.n_batch);
if (llama_decode(ctx_guidance, llama_batch_get_one(input_buf + i, n_eval, n_past_guidance, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
n_past_guidance += n_eval;
}
}
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
int n_eval = (int) embd.size() - i;
if (n_eval > params.n_batch) {
@@ -525,11 +414,9 @@ int main(int argc, char ** argv) {
}
embd.clear();
embd_guidance.clear();
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
const llama_token id = llama_sampling_sample(ctx_sampling, ctx, ctx_guidance);
const llama_token id = llama_sampling_sample(ctx_sampling, ctx, nullptr);
llama_sampling_accept(ctx_sampling, ctx, id, true);
@@ -583,7 +470,6 @@ int main(int argc, char ** argv) {
// if not currently processing queued inputs;
if ((int) embd_inp.size() <= n_consumed) {
// deal with eot token in infill mode
if ((llama_sampling_last(ctx_sampling) == llama_token_eot(model) || is_interacting) && params.interactive){
if (is_interacting && !params.interactive_first) {
@@ -644,7 +530,6 @@ int main(int argc, char ** argv) {
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
embd_inp.push_back(llama_token_middle(model));
embd.clear();
embd_guidance.clear();
n_remain = params.n_predict;
n_past = 0;
n_consumed = 0;
@@ -751,7 +636,6 @@ int main(int argc, char ** argv) {
llama_print_timings(ctx);
write_logfile(ctx, params, model, input_tokens, output_ss.str(), output_tokens);
if (ctx_guidance) { llama_free(ctx_guidance); }
llama_free(ctx);
llama_free_model(model);

View File

@@ -6,52 +6,22 @@ import re
import sys
from typing import Any, Dict, List, Set, Tuple, Union
def _build_repetition(item_rule, min_items, max_items, separator_rule=None, item_rule_is_literal=False):
def _build_repetition(item_rule, min_items, max_items, separator_rule=None):
if min_items == 0 and max_items == 1:
return f'{item_rule}?'
if not separator_rule:
if min_items == 0 and max_items == 1:
return f'{item_rule}?'
elif min_items == 1 and max_items is None:
if min_items == 1 and max_items is None:
return f'{item_rule}+'
result = ''
if min_items > 0:
if item_rule_is_literal and separator_rule is None:
result = '"' + (item_rule[1:-1] * min_items) + '"'
elif min_items == 0 and max_items is None:
return f'{item_rule}*'
else:
result = (f' {separator_rule} ' if separator_rule else ' ').join([item_rule] * min_items)
return f'{item_rule}{{{min_items},{max_items if max_items is not None else ""}}}'
def opt_repetitions(up_to_n, prefix_with_sep=False):
'''
- n=4, no sep: '(a (a (a (a)?)?)?)?'
- n=4, sep=',', prefix: '("," a ("," a ("," a ("," a)?)?)?)?'
- n=4, sep=',', no prefix: '(a ("," a ("," a ("," a)?)?)?)?'
'''
content = f'{separator_rule} {item_rule}' if prefix_with_sep and separator_rule else item_rule
if up_to_n == 0:
return ''
elif up_to_n == 1:
return f'({content})?'
elif separator_rule and not prefix_with_sep:
return f'({content} {opt_repetitions(up_to_n - 1, prefix_with_sep=True)})?'
else:
return (f'({content} ' * up_to_n).rstrip() + (')?' * up_to_n)
if min_items > 0 and max_items != min_items:
result += ' '
if max_items is not None:
result += opt_repetitions(max_items - min_items, prefix_with_sep=min_items > 0)
else:
item_operator = f'({separator_rule + " " if separator_rule else ""}{item_rule})'
if min_items == 0 and separator_rule:
result = f'({item_rule} {item_operator}*)?'
else:
result += f'{item_operator}*'
return result
result = item_rule + ' ' + _build_repetition(f'({separator_rule} {item_rule})', min_items - 1 if min_items > 0 else 0, max_items - 1 if max_items is not None else None)
return f'({result})?' if min_items == 0 else result
class BuiltinRule:
@@ -59,31 +29,29 @@ class BuiltinRule:
self.content = content
self.deps = deps or []
_up_to_15_digits = _build_repetition('[0-9]', 0, 15)
# whitespace is constrained to a single space char to prevent model "running away" in
# whitespace. Also maybe improves generation quality?
SPACE_RULE = '" "?'
PRIMITIVE_RULES = {
'boolean' : BuiltinRule('("true" | "false") space', []),
'decimal-part' : BuiltinRule('[0-9] ' + _up_to_15_digits, []),
'integral-part': BuiltinRule('[0-9] | [1-9] ' + _up_to_15_digits, []),
'decimal-part' : BuiltinRule('[0-9]{1,16}', []),
'integral-part': BuiltinRule('[0] | [1-9] [0-9]{0,15}', []),
'number' : BuiltinRule('("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space', ['integral-part', 'decimal-part']),
'integer' : BuiltinRule('("-"? integral-part) space', ['integral-part']),
'value' : BuiltinRule('object | array | string | number | boolean | null', ['object', 'array', 'string', 'number', 'boolean', 'null']),
'object' : BuiltinRule('"{" space ( string ":" space value ("," space string ":" space value)* )? "}" space', ['string', 'value']),
'array' : BuiltinRule('"[" space ( value ("," space value)* )? "]" space', ['value']),
'uuid' : BuiltinRule(r'"\"" ' + ' "-" '.join('[0-9a-fA-F]' * n for n in [8, 4, 4, 4, 12]) + r' "\"" space', []),
'char' : BuiltinRule(r'[^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])', []),
'uuid' : BuiltinRule(r'"\"" [0-9a-fA-F]{8} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{12} "\"" space', []),
'char' : BuiltinRule(r'[^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F]{4})', []),
'string' : BuiltinRule(r'"\"" char* "\"" space', ['char']),
'null' : BuiltinRule('"null" space', []),
}
# TODO: support "uri", "email" string formats
STRING_FORMAT_RULES = {
'date' : BuiltinRule('[0-9] [0-9] [0-9] [0-9] "-" ( "0" [1-9] | "1" [0-2] ) "-" ( \"0\" [1-9] | [1-2] [0-9] | "3" [0-1] )', []),
'time' : BuiltinRule('([01] [0-9] | "2" [0-3]) ":" [0-5] [0-9] ":" [0-5] [0-9] ( "." [0-9] [0-9] [0-9] )? ( "Z" | ( "+" | "-" ) ( [01] [0-9] | "2" [0-3] ) ":" [0-5] [0-9] )', []),
'date' : BuiltinRule('[0-9]{4} "-" ( "0" [1-9] | "1" [0-2] ) "-" ( \"0\" [1-9] | [1-2] [0-9] | "3" [0-1] )', []),
'time' : BuiltinRule('([01] [0-9] | "2" [0-3]) ":" [0-5] [0-9] ":" [0-5] [0-9] ( "." [0-9]{3} )? ( "Z" | ( "+" | "-" ) ( [01] [0-9] | "2" [0-3] ) ":" [0-5] [0-9] )', []),
'date-time' : BuiltinRule('date "T" time', ['date', 'time']),
'date-string' : BuiltinRule('"\\"" date "\\"" space', ['date']),
'time-string' : BuiltinRule('"\\"" time "\\"" space', ['time']),
@@ -333,7 +301,7 @@ class SchemaConverter:
sub_rule_ids[sub] = id
sub = id
seq[-1] = (_build_repetition(f'"{sub}"' if sub_is_literal else sub, min_times, max_times, item_rule_is_literal=sub_is_literal), False)
seq[-1] = (_build_repetition(f'"{sub}"' if sub_is_literal else sub, min_times, max_times), False)
else:
literal = ''
while i < length:

View File

@@ -162,7 +162,7 @@ $ ./llama-bench -o csv
```
```csv
build_commit,build_number,cuda,opencl,metal,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_threads,f16_kv,n_gpu_layers,main_gpu,mul_mat_q,tensor_split,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
build_commit,build_number,cuda,metal,gpu_blas,blas,cpu_info,gpu_info,model_filename,model_type,model_size,model_n_params,n_batch,n_threads,f16_kv,n_gpu_layers,main_gpu,mul_mat_q,tensor_split,n_prompt,n_gen,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","512","0","2023-09-23T12:09:01Z","212155977","732372","2413.341687","8.305961"
"3469684","1275","1","0","0","1","1","13th Gen Intel(R) Core(TM) i9-13900K","NVIDIA GeForce RTX 3090 Ti","models/7B/ggml-model-q4_0.gguf","llama 7B mostly Q4_0","3825065984","6738415616","512","16","1","99","0","1","0.00","0","128","2023-09-23T12:09:02Z","969320879","2728399","132.052051","0.371342"
```
@@ -179,7 +179,6 @@ $ ./llama-bench -o json
"build_commit": "3469684",
"build_number": 1275,
"cuda": true,
"opencl": false,
"metal": false,
"gpu_blas": true,
"blas": true,
@@ -210,7 +209,6 @@ $ ./llama-bench -o json
"build_commit": "3469684",
"build_number": 1275,
"cuda": true,
"opencl": false,
"metal": false,
"gpu_blas": true,
"blas": true,
@@ -253,7 +251,6 @@ CREATE TABLE IF NOT EXISTS test (
build_commit TEXT,
build_number INTEGER,
cuda INTEGER,
opencl INTEGER,
metal INTEGER,
gpu_blas INTEGER,
blas INTEGER,
@@ -279,6 +276,6 @@ CREATE TABLE IF NOT EXISTS test (
stddev_ts REAL
);
INSERT INTO test (build_commit, build_number, cuda, opencl, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '512', '0', '2023-09-23T12:10:30Z', '212693772', '743623', '2407.240204', '8.409634');
INSERT INTO test (build_commit, build_number, cuda, opencl, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '0', '128', '2023-09-23T12:10:31Z', '977925003', '4037361', '130.891159', '0.537692');
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '512', '0', '2023-09-23T12:10:30Z', '212693772', '743623', '2407.240204', '8.409634');
INSERT INTO test (build_commit, build_number, cuda, metal, gpu_blas, blas, cpu_info, gpu_info, model_filename, model_type, model_size, model_n_params, n_batch, n_threads, f16_kv, n_gpu_layers, main_gpu, mul_mat_q, tensor_split, n_prompt, n_gen, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('3469684', '1275', '1', '0', '0', '1', '1', '13th Gen Intel(R) Core(TM) i9-13900K', 'NVIDIA GeForce RTX 3090 Ti', 'models/7B/ggml-model-q4_0.gguf', 'llama 7B mostly Q4_0', '3825065984', '6738415616', '512', '16', '1', '99', '0', '1', '0.00', '0', '128', '2023-09-23T12:10:31Z', '977925003', '4037361', '130.891159', '0.537692');
```

View File

@@ -41,20 +41,6 @@ static std::string join(const std::vector<T> & values, const std::string & delim
return str.str();
}
template<class T>
static std::vector<T> split(const std::string & str, char delim) {
std::vector<T> values;
std::istringstream str_stream(str);
std::string token;
while (std::getline(str_stream, token, delim)) {
T value;
std::istringstream token_stream(token);
token_stream >> value;
values.push_back(value);
}
return values;
}
template<typename T, typename F>
static std::vector<std::string> transform_to_str(const std::vector<T> & values, F f) {
std::vector<std::string> str_values;
@@ -140,10 +126,11 @@ static std::string get_gpu_info() {
}
// command line params
enum output_formats {CSV, JSON, MARKDOWN, SQL};
enum output_formats {NONE, CSV, JSON, MARKDOWN, SQL};
static const char * output_format_str(output_formats format) {
switch (format) {
case NONE: return "none";
case CSV: return "csv";
case JSON: return "json";
case MARKDOWN: return "md";
@@ -152,6 +139,23 @@ static const char * output_format_str(output_formats format) {
}
}
static bool output_format_from_str(const std::string & s, output_formats & format) {
if (s == "none") {
format = NONE;
} else if (s == "csv") {
format = CSV;
} else if (s == "json") {
format = JSON;
} else if (s == "md") {
format = MARKDOWN;
} else if (s == "sql") {
format = SQL;
} else {
return false;
}
return true;
}
static const char * split_mode_str(llama_split_mode mode) {
switch (mode) {
case LLAMA_SPLIT_MODE_NONE: return "none";
@@ -190,31 +194,33 @@ struct cmd_params {
int reps;
bool verbose;
output_formats output_format;
output_formats output_format_stderr;
};
static const cmd_params cmd_params_defaults = {
/* model */ {"models/7B/ggml-model-q4_0.gguf"},
/* n_prompt */ {512},
/* n_gen */ {128},
/* n_pg */ {},
/* n_batch */ {2048},
/* n_ubatch */ {512},
/* type_k */ {GGML_TYPE_F16},
/* type_v */ {GGML_TYPE_F16},
/* n_threads */ {cpu_get_num_math()},
/* n_gpu_layers */ {99},
/* rpc_servers */ {""},
/* split_mode */ {LLAMA_SPLIT_MODE_LAYER},
/* main_gpu */ {0},
/* no_kv_offload */ {false},
/* flash_attn */ {false},
/* tensor_split */ {std::vector<float>(llama_max_devices(), 0.0f)},
/* use_mmap */ {true},
/* embeddings */ {false},
/* numa */ GGML_NUMA_STRATEGY_DISABLED,
/* reps */ 5,
/* verbose */ false,
/* output_format */ MARKDOWN
/* model */ {"models/7B/ggml-model-q4_0.gguf"},
/* n_prompt */ {512},
/* n_gen */ {128},
/* n_pg */ {},
/* n_batch */ {2048},
/* n_ubatch */ {512},
/* type_k */ {GGML_TYPE_F16},
/* type_v */ {GGML_TYPE_F16},
/* n_threads */ {cpu_get_num_math()},
/* n_gpu_layers */ {99},
/* rpc_servers */ {""},
/* split_mode */ {LLAMA_SPLIT_MODE_LAYER},
/* main_gpu */ {0},
/* no_kv_offload */ {false},
/* flash_attn */ {false},
/* tensor_split */ {std::vector<float>(llama_max_devices(), 0.0f)},
/* use_mmap */ {true},
/* embeddings */ {false},
/* numa */ GGML_NUMA_STRATEGY_DISABLED,
/* reps */ 5,
/* verbose */ false,
/* output_format */ MARKDOWN,
/* output_format_stderr */ NONE,
};
static void print_usage(int /* argc */, char ** argv) {
@@ -243,6 +249,7 @@ static void print_usage(int /* argc */, char ** argv) {
printf(" -ts, --tensor-split <ts0/ts1/..> (default: 0)\n");
printf(" -r, --repetitions <n> (default: %d)\n", cmd_params_defaults.reps);
printf(" -o, --output <csv|json|md|sql> (default: %s)\n", output_format_str(cmd_params_defaults.output_format));
printf(" -oe, --output-err <csv|json|md|sql> (default: %s)\n", output_format_str(cmd_params_defaults.output_format_stderr));
printf(" -v, --verbose (default: %s)\n", cmd_params_defaults.verbose ? "1" : "0");
printf("\n");
printf("Multiple values can be given for each parameter by separating them with ',' or by specifying the parameter multiple times.\n");
@@ -284,6 +291,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
params.verbose = cmd_params_defaults.verbose;
params.output_format = cmd_params_defaults.output_format;
params.output_format_stderr = cmd_params_defaults.output_format_stderr;
params.reps = cmd_params_defaults.reps;
for (int i = 1; i < argc; i++) {
@@ -300,28 +308,28 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<std::string>(argv[i], split_delim);
auto p = string_split<std::string>(argv[i], split_delim);
params.model.insert(params.model.end(), p.begin(), p.end());
} else if (arg == "-p" || arg == "--n-prompt") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_prompt.insert(params.n_prompt.end(), p.begin(), p.end());
} else if (arg == "-n" || arg == "--n-gen") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_gen.insert(params.n_gen.end(), p.begin(), p.end());
} else if (arg == "-pg") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<std::string>(argv[i], ',');
auto p = string_split<std::string>(argv[i], ',');
if (p.size() != 2) {
invalid_param = true;
break;
@@ -332,21 +340,21 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_batch.insert(params.n_batch.end(), p.begin(), p.end());
} else if (arg == "-ub" || arg == "--ubatch-size") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_ubatch.insert(params.n_ubatch.end(), p.begin(), p.end());
} else if (arg == "-ctk" || arg == "--cache-type-k") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<std::string>(argv[i], split_delim);
auto p = string_split<std::string>(argv[i], split_delim);
std::vector<ggml_type> types;
for (const auto & t : p) {
ggml_type gt = ggml_type_from_name(t);
@@ -362,7 +370,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<std::string>(argv[i], split_delim);
auto p = string_split<std::string>(argv[i], split_delim);
std::vector<ggml_type> types;
for (const auto & t : p) {
ggml_type gt = ggml_type_from_name(t);
@@ -378,14 +386,14 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_threads.insert(params.n_threads.end(), p.begin(), p.end());
} else if (arg == "-ngl" || arg == "--n-gpu-layers") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<int>(argv[i], split_delim);
auto p = string_split<int>(argv[i], split_delim);
params.n_gpu_layers.insert(params.n_gpu_layers.end(), p.begin(), p.end());
} else if (arg == "-rpc" || arg == "--rpc") {
if (++i >= argc) {
@@ -398,7 +406,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<std::string>(argv[i], split_delim);
auto p = string_split<std::string>(argv[i], split_delim);
std::vector<llama_split_mode> modes;
for (const auto & m : p) {
llama_split_mode mode;
@@ -420,13 +428,13 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
params.main_gpu = split<int>(argv[i], split_delim);
params.main_gpu = string_split<int>(argv[i], split_delim);
} else if (arg == "-nkvo" || arg == "--no-kv-offload") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<bool>(argv[i], split_delim);
auto p = string_split<bool>(argv[i], split_delim);
params.no_kv_offload.insert(params.no_kv_offload.end(), p.begin(), p.end());
} else if (arg == "--numa") {
if (++i >= argc) {
@@ -444,28 +452,28 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
auto p = split<bool>(argv[i], split_delim);
auto p = string_split<bool>(argv[i], split_delim);
params.flash_attn.insert(params.flash_attn.end(), p.begin(), p.end());
} else if (arg == "-mmp" || arg == "--mmap") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<bool>(argv[i], split_delim);
auto p = string_split<bool>(argv[i], split_delim);
params.use_mmap.insert(params.use_mmap.end(), p.begin(), p.end());
} else if (arg == "-embd" || arg == "--embeddings") {
if (++i >= argc) {
invalid_param = true;
break;
}
auto p = split<bool>(argv[i], split_delim);
auto p = string_split<bool>(argv[i], split_delim);
params.embeddings.insert(params.embeddings.end(), p.begin(), p.end());
} else if (arg == "-ts" || arg == "--tensor-split") {
if (++i >= argc) {
invalid_param = true;
break;
}
for (auto ts : split<std::string>(argv[i], split_delim)) {
for (auto ts : string_split<std::string>(argv[i], split_delim)) {
// split string by ; and /
const std::regex regex{R"([;/]+)"};
std::sregex_token_iterator it{ts.begin(), ts.end(), regex, -1};
@@ -493,18 +501,13 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
invalid_param = true;
break;
}
if (argv[i] == std::string("csv")) {
params.output_format = CSV;
} else if (argv[i] == std::string("json")) {
params.output_format = JSON;
} else if (argv[i] == std::string("md")) {
params.output_format = MARKDOWN;
} else if (argv[i] == std::string("sql")) {
params.output_format = SQL;
} else {
invalid_param = !output_format_from_str(argv[i], params.output_format);
} else if (arg == "-oe" || arg == "--output-err") {
if (++i >= argc) {
invalid_param = true;
break;
}
invalid_param = !output_format_from_str(argv[i], params.output_format_stderr);
} else if (arg == "-v" || arg == "--verbose") {
params.verbose = true;
} else {
@@ -706,7 +709,6 @@ struct test {
static const std::string build_commit;
static const int build_number;
static const bool cuda;
static const bool opencl;
static const bool vulkan;
static const bool kompute;
static const bool metal;
@@ -795,9 +797,6 @@ struct test {
if (cuda) {
return GGML_CUDA_NAME;
}
if (opencl) {
return "OpenCL";
}
if (vulkan) {
return "Vulkan";
}
@@ -826,7 +825,7 @@ struct test {
static const std::vector<std::string> & get_fields() {
static const std::vector<std::string> fields = {
"build_commit", "build_number",
"cuda", "opencl", "vulkan", "kompute", "metal", "sycl", "rpc", "gpu_blas", "blas",
"cuda", "vulkan", "kompute", "metal", "sycl", "rpc", "gpu_blas", "blas",
"cpu_info", "gpu_info",
"model_filename", "model_type", "model_size", "model_n_params",
"n_batch", "n_ubatch",
@@ -852,7 +851,7 @@ struct test {
field == "avg_ns" || field == "stddev_ns") {
return INT;
}
if (field == "cuda" || field == "opencl" || field == "vulkan" || field == "kompute" || field == "metal" ||
if (field == "cuda" || field == "vulkan" || field == "kompute" || field == "metal" ||
field == "gpu_blas" || field == "blas" || field == "sycl" ||field == "f16_kv" || field == "no_kv_offload" ||
field == "flash_attn" || field == "use_mmap" || field == "embeddings") {
return BOOL;
@@ -881,7 +880,7 @@ struct test {
}
std::vector<std::string> values = {
build_commit, std::to_string(build_number),
std::to_string(cuda), std::to_string(opencl), std::to_string(vulkan), std::to_string(vulkan),
std::to_string(cuda), std::to_string(vulkan), std::to_string(vulkan),
std::to_string(metal), std::to_string(sycl), std::to_string(rpc), std::to_string(gpu_blas), std::to_string(blas),
cpu_info, gpu_info,
model_filename, model_type, std::to_string(model_size), std::to_string(model_n_params),
@@ -910,7 +909,6 @@ struct test {
const std::string test::build_commit = LLAMA_COMMIT;
const int test::build_number = LLAMA_BUILD_NUMBER;
const bool test::cuda = !!ggml_cpu_has_cuda();
const bool test::opencl = !!ggml_cpu_has_clblast();
const bool test::vulkan = !!ggml_cpu_has_vulkan();
const bool test::kompute = !!ggml_cpu_has_kompute();
const bool test::metal = !!ggml_cpu_has_metal();
@@ -1278,6 +1276,22 @@ static void llama_null_log_callback(enum ggml_log_level level, const char * text
(void) user_data;
}
static std::unique_ptr<printer> create_printer(output_formats format) {
switch (format) {
case NONE:
return nullptr;
case CSV:
return std::unique_ptr<printer>(new csv_printer());
case JSON:
return std::unique_ptr<printer>(new json_printer());
case MARKDOWN:
return std::unique_ptr<printer>(new markdown_printer());
case SQL:
return std::unique_ptr<printer>(new sql_printer());
}
GGML_ASSERT(false);
}
int main(int argc, char ** argv) {
// try to set locale for unicode characters in markdown
setlocale(LC_CTYPE, ".UTF-8");
@@ -1304,26 +1318,18 @@ int main(int argc, char ** argv) {
llama_numa_init(params.numa);
// initialize printer
std::unique_ptr<printer> p;
switch (params.output_format) {
case CSV:
p.reset(new csv_printer());
break;
case JSON:
p.reset(new json_printer());
break;
case MARKDOWN:
p.reset(new markdown_printer());
break;
case SQL:
p.reset(new sql_printer());
break;
default:
assert(false);
exit(1);
std::unique_ptr<printer> p = create_printer(params.output_format);
std::unique_ptr<printer> p_err = create_printer(params.output_format_stderr);
if (p) {
p->fout = stdout;
p->print_header(params);
}
if (p_err) {
p_err->fout = stderr;
p_err->print_header(params);
}
p->fout = stdout;
p->print_header(params);
std::vector<cmd_params_instance> params_instances = get_cmd_params_instances(params);
@@ -1381,7 +1387,15 @@ int main(int argc, char ** argv) {
t.samples_ns.push_back(t_ns);
}
p->print_test(t);
if (p) {
p->print_test(t);
fflush(p->fout);
}
if (p_err) {
p_err->print_test(t);
fflush(p_err->fout);
}
llama_print_timings(ctx);
@@ -1390,7 +1404,13 @@ int main(int argc, char ** argv) {
llama_free_model(lmodel);
p->print_footer();
if (p) {
p->print_footer();
}
if (p_err) {
p_err->print_footer();
}
llama_backend_free();

View File

@@ -1,18 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./main -m models/available/Llama2/13B/llama-2-13b.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8

View File

@@ -1,18 +0,0 @@
#!/bin/bash
#
# Temporary script - will be removed in the future
#
cd `dirname $0`
cd ..
./main -m models/available/Llama2/7B/llama-2-7b.ggmlv3.q4_0.bin \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8

View File

@@ -112,9 +112,12 @@ struct llava_context {
struct llama_model * model = NULL;
};
static void show_additional_info(int /*argc*/, char ** argv) {
LOG_TEE("\n example usage: %s -m <llava-v1.5-7b/ggml-model-q5_k.gguf> --mmproj <llava-v1.5-7b/mmproj-model-f16.gguf> --image <path/to/an/image.jpg> --image <path/to/another/image.jpg> [--temp 0.1] [-p \"describe the image in detail.\"]\n", argv[0]);
LOG_TEE(" note: a lower temperature value like 0.1 is recommended for better quality.\n");
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\n example usage:\n");
LOG_TEE("\n %s -m <llava-v1.5-7b/ggml-model-q5_k.gguf> --mmproj <llava-v1.5-7b/mmproj-model-f16.gguf> --image <path/to/an/image.jpg> --image <path/to/another/image.jpg> [--temp 0.1] [-p \"describe the image in detail.\"]\n", argv[0]);
LOG_TEE("\n note: a lower temperature value like 0.1 is recommended for better quality.\n");
}
static struct llava_image_embed * load_image(llava_context * ctx_llava, gpt_params * params, const std::string & fname) {
@@ -278,7 +281,7 @@ int main(int argc, char ** argv) {
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
show_additional_info(argc, argv);
print_usage(argc, argv, params);
return 1;
}
@@ -290,8 +293,7 @@ int main(int argc, char ** argv) {
#endif // LOG_DISABLE_LOGS
if (params.mmproj.empty() || (params.image.empty() && !prompt_contains_image(params.prompt))) {
gpt_params_print_usage(argc, argv, params);
show_additional_info(argc, argv);
print_usage(argc, argv, {});
return 1;
}
auto model = llava_init(&params);

View File

@@ -37,7 +37,8 @@ struct ngram_container {
int main(int argc, char ** argv) {
gpt_params params;
if (gpt_params_parse(argc, argv, params) == false) {
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -14,8 +14,10 @@ int main(int argc, char ** argv){
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
// init llama.cpp
llama_backend_init();
llama_numa_init(params.numa);

View File

@@ -16,6 +16,7 @@ int main(int argc, char ** argv){
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -15,6 +15,7 @@ int main(int argc, char ** argv){
gpt_params params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -8,16 +8,14 @@ Because this example is "outside of the source tree", it is important to first b
### Considerations
When hardware acceleration libraries are used (e.g. CUDA, Metal, CLBlast, etc.), CMake must be able to locate the associated CMake package. In the example below, when building _main-cmake-pkg_ notice the `CMAKE_PREFIX_PATH` includes the Llama CMake package location _in addition to_ the CLBlast package—which was used when compiling _llama.cpp_.
When hardware acceleration libraries are used (e.g. CUDA, Metal, etc.), CMake must be able to locate the associated CMake package.
### Build llama.cpp and install to C:\LlamaCPP directory
In this case, CLBlast was already installed so the CMake package is referenced in `CMAKE_PREFIX_PATH`.
```cmd
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=C:/CLBlast/lib/cmake/CLBlast -G "Visual Studio 17 2022" -A x64
cmake -B build -DBUILD_SHARED_LIBS=OFF -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/LlamaCPP
```
@@ -27,7 +25,7 @@ cmake --install build --prefix C:/LlamaCPP
```cmd
cd ..\examples\main-cmake-pkg
cmake -B build -DBUILD_SHARED_LIBS=OFF -DCMAKE_PREFIX_PATH="C:/CLBlast/lib/cmake/CLBlast;C:/LlamaCPP/lib/cmake/Llama" -G "Visual Studio 17 2022" -A x64
cmake -B build -DBUILD_SHARED_LIBS=OFF -DCMAKE_PREFIX_PATH="C:/LlamaCPP/lib/cmake/Llama" -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release
cmake --install build --prefix C:/MyLlamaApp
```

View File

@@ -53,13 +53,13 @@ The following command generates "infinite" text from a starting prompt (you can
#### Unix-based systems (Linux, macOS, etc.):
```bash
./main -m models/7B/ggml-model.bin --ignore-eos -n -1 --random-prompt
./main -m models/7B/ggml-model.bin --ignore-eos -n -1
```
#### Windows:
```powershell
main.exe -m models\7B\ggml-model.bin --ignore-eos -n -1 --random-prompt
main.exe -m models\7B\ggml-model.bin --ignore-eos -n -1
```
## Common Options
@@ -69,7 +69,6 @@ In this section, we cover the most commonly used options for running the `main`
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.gguf`; inferred from `--model-url` if set).
- `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g https://huggingface.co/ggml-org/models/resolve/main/phi-2/ggml-model-q4_0.gguf).
- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
- `-ins, --instruct`: Run the program in instruction mode, which is particularly useful when working with Alpaca models.
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
@@ -80,11 +79,10 @@ The `main` program provides several ways to interact with the LLaMA models using
- `--prompt PROMPT`: Provide a prompt directly as a command-line option.
- `--file FNAME`: Provide a file containing a prompt or multiple prompts.
- `--interactive-first`: Run the program in interactive mode and wait for input right away. (More on this below.)
- `--random-prompt`: Start with a randomized prompt.
## Interaction
The `main` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive`, `--interactive-first`, and `--instruct`.
The `main` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`.
In interactive mode, users can participate in text generation by injecting their input during the process. Users can press `Ctrl+C` at any time to interject and type their input, followed by pressing `Return` to submit it to the LLaMA model. To submit additional lines without finalizing input, users can end the current line with a backslash (`\`) and continue typing.
@@ -92,7 +90,6 @@ In interactive mode, users can participate in text generation by injecting their
- `-i, --interactive`: Run the program in interactive mode, allowing users to engage in real-time conversations or provide specific instructions to the model.
- `--interactive-first`: Run the program in interactive mode and immediately wait for user input before starting the text generation.
- `-ins, --instruct`: Run the program in instruction mode, which is specifically designed to work with Alpaca models that excel in completing tasks based on user instructions.
- `--color`: Enable colorized output to differentiate visually distinguishing between prompts, user input, and generated text.
By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
@@ -121,16 +118,6 @@ The `--in-suffix` flag is used to add a suffix after your input. This is useful
./main -r "User:" --in-prefix " " --in-suffix "Assistant:"
```
### Instruction Mode
Instruction mode is particularly useful when working with Alpaca models, which are designed to follow user instructions for specific tasks:
- `-ins, --instruct`: Enable instruction mode to leverage the capabilities of Alpaca models in completing tasks based on user-provided instructions.
Technical detail: the user's input is internally prefixed with the reverse prompt (or `### Instruction:` as the default), and followed by `### Response:` (except if you just press Return without any input, to keep generating a longer response).
By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with the LLaMA models, tailoring the text generation process to your specific needs.
## Context Management
During text generation, LLaMA models have a limited context size, which means they can only consider a certain number of tokens from the input and generated text. When the context fills up, the model resets internally, potentially losing some information from the beginning of the conversation or instructions. Context management options help maintain continuity and coherence in these situations.

View File

@@ -122,8 +122,10 @@ int main(int argc, char ** argv) {
g_params = &params;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
llama_sampling_params & sparams = params.sparams;
#ifndef LOG_DISABLE_LOGS
@@ -180,9 +182,6 @@ int main(int argc, char ** argv) {
LOG_TEE("%s: seed = %u\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = string_random_prompt(rng);
}
LOG("%s: llama backend init\n", __func__);
llama_backend_init();
@@ -250,11 +249,8 @@ int main(int argc, char ** argv) {
std::vector<llama_token> embd_inp;
if (params.interactive_first || params.instruct || params.chatml || !params.prompt.empty() || session_tokens.empty()) {
if (params.interactive_first || !params.prompt.empty() || session_tokens.empty()) {
LOG("tokenize the prompt\n");
if (params.chatml) {
params.prompt = "<|im_start|>system\n" + params.prompt + "<|im_end|>";
}
embd_inp = ::llama_tokenize(ctx, params.prompt, true, true);
} else {
LOG("use session tokens\n");
@@ -332,37 +328,13 @@ int main(int argc, char ** argv) {
}
// number of tokens to keep when resetting context
if (params.n_keep < 0 || params.n_keep > (int) embd_inp.size() || params.instruct || params.chatml) {
if (params.n_keep < 0 || params.n_keep > (int) embd_inp.size()) {
params.n_keep = (int)embd_inp.size();
} else {
params.n_keep += add_bos; // always keep the BOS token
}
// prefix & suffix for instruct mode
const auto inp_pfx = ::llama_tokenize(ctx, "\n\n### Instruction:\n\n", true, true);
const auto inp_sfx = ::llama_tokenize(ctx, "\n\n### Response:\n\n", false, true);
LOG("inp_pfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_pfx).c_str());
LOG("inp_sfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, inp_sfx).c_str());
// chatml prefix & suffix
const auto cml_pfx = ::llama_tokenize(ctx, "\n<|im_start|>user\n", true, true);
const auto cml_sfx = ::llama_tokenize(ctx, "<|im_end|>\n<|im_start|>assistant\n", false, true);
LOG("cml_pfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, cml_pfx).c_str());
LOG("cml_sfx: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, cml_sfx).c_str());
// in instruct mode, we inject a prefix and a suffix to each input by the user
if (params.instruct) {
params.interactive_first = true;
params.antiprompt.emplace_back("### Instruction:\n\n");
}
// similar for chatml mode
else if (params.chatml) {
params.interactive_first = true;
params.antiprompt.emplace_back("<|im_start|>user\n");
}
else if (params.conversation) {
if (params.conversation) {
params.interactive_first = true;
}
@@ -823,15 +795,13 @@ int main(int argc, char ** argv) {
is_interacting = true;
printf("\n");
} else if (params.instruct || params.chatml) {
is_interacting = true;
}
}
if (n_past > 0 && is_interacting) {
LOG("waiting for user input\n");
if (params.conversation || params.instruct || params.chatml) {
if (params.conversation) {
printf("\n> ");
}
@@ -874,24 +844,12 @@ int main(int argc, char ** argv) {
const size_t original_size = embd_inp.size();
// instruct mode: insert instruction prefix
if (params.instruct && !is_antiprompt) {
LOG("inserting instruction prefix\n");
n_consumed = embd_inp.size();
embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
}
// chatml mode: insert user chat prefix
if (params.chatml && !is_antiprompt) {
LOG("inserting chatml prefix\n");
n_consumed = embd_inp.size();
embd_inp.insert(embd_inp.end(), cml_pfx.begin(), cml_pfx.end());
}
if (params.escape) {
string_process_escapes(buffer);
}
const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true);
const auto line_inp = ::llama_tokenize(ctx, buffer, false, params.interactive_specials);
const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);
const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true);
LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str());
@@ -900,17 +858,6 @@ int main(int argc, char ** argv) {
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
embd_inp.insert(embd_inp.end(), line_sfx.begin(), line_sfx.end());
// instruct mode: insert response suffix
if (params.instruct) {
LOG("inserting instruction suffix\n");
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
}
// chatml mode: insert assistant chat suffix
if (params.chatml) {
LOG("inserting chatml suffix\n");
embd_inp.insert(embd_inp.end(), cml_sfx.begin(), cml_sfx.end());
}
for (size_t i = original_size; i < embd_inp.size(); ++i) {
const llama_token token = embd_inp[i];
output_tokens.push_back(token);
@@ -935,7 +882,7 @@ int main(int argc, char ** argv) {
}
// end of generation
if (!embd.empty() && llama_token_is_eog(model, embd.back()) && !(params.instruct || params.interactive || params.chatml)) {
if (!embd.empty() && llama_token_is_eog(model, embd.back()) && !(params.interactive)) {
LOG_TEE(" [end of text]\n");
break;
}

View File

@@ -100,7 +100,8 @@ int main(int argc, char ** argv) {
gpt_params params;
if (gpt_params_parse(argc, argv, params) == false) {
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -8,5 +8,5 @@ See the following PRs for more info:
### Usage
```bash
make -j && ./passkey ./models/llama-7b-v2/ggml-model-f16.gguf 250
make -j && ./passkey -m ./models/llama-7b-v2/ggml-model-f16.gguf --junk 250
```

View File

@@ -6,46 +6,32 @@
#include <string>
#include <vector>
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s -m model.gguf --junk 250 --pos 90 --keep 32 --grp-attn-n 2 [--seed 1234]\n", argv[0]);
LOG_TEE("\n");
}
int main(int argc, char ** argv) {
gpt_params params;
if (argc == 1 || argv[1][0] == '-') {
printf("usage: %s MODEL_PATH N_JUNK N_GRP I_POS SEED\n" , argv[0]);
return 1 ;
params.n_junk = 250;
params.n_keep = 32;
params.i_pos = -1;
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
int seed = -1;
srand(params.seed == LLAMA_DEFAULT_SEED ? time(NULL) : params.seed);
int n_junk = 250; // number of times to repeat the junk text
int n_keep = 32; // number of tokens in the prompt prefix
int n_grp = 1; // if more than 1 - perform LongLM SelfExtend
int i_pos = -1; // position of the passkey in the junk text
if (argc >= 2) {
params.model = argv[1];
}
if (argc >= 3) {
n_junk = std::stoi(argv[2]);
}
if (argc >= 4) {
n_grp = std::stoi(argv[3]);
}
if (argc >= 5) {
i_pos = std::stoi(argv[4]);
}
if (argc >= 6) {
seed = std::stoi(argv[5]);
}
if (seed == -1) {
seed = time(NULL);
}
srand(seed);
int n_junk = params.n_junk;
int n_keep = params.n_keep;
int n_grp = params.grp_attn_n;
int i_pos = params.i_pos;
if (i_pos == -1) {
i_pos = rand() % n_junk;
@@ -76,9 +62,7 @@ int main(int argc, char ** argv) {
// initialize the model
llama_model_params model_params = llama_model_default_params();
model_params.n_gpu_layers = 99; // offload all layers to the GPU
llama_model_params model_params = llama_model_params_from_gpt_params(params);
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
@@ -89,13 +73,9 @@ int main(int argc, char ** argv) {
// initialize the context
llama_context_params ctx_params = llama_context_default_params();
llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
ctx_params.seed = seed;
ctx_params.n_ctx = llama_n_ctx_train(model)*n_grp + n_keep;
ctx_params.n_batch = 512;
ctx_params.n_threads = params.n_threads;
ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
ctx_params.n_ctx = llama_n_ctx_train(model)*n_grp + n_keep;
GGML_ASSERT(ctx_params.n_batch % n_grp == 0 && "n_batch must be divisible by n_grp");
@@ -135,7 +115,7 @@ int main(int argc, char ** argv) {
LOG_TEE("prompt tokens: %d\n", n_tokens_all);
//LOG_TEE("prompt: %s\n", params.prompt.c_str());
llama_batch batch = llama_batch_init(512, 0, 1);
llama_batch batch = llama_batch_init(params.n_batch, 0, 1);
int n_past = 0;

View File

@@ -1032,7 +1032,7 @@ struct winogrande_entry {
std::vector<llama_token> seq_tokens[2];
};
static std::vector<winogrande_entry> load_winogrande_from_csv(const std::string& prompt) {
static std::vector<winogrande_entry> load_winogrande_from_csv(const std::string & prompt) {
std::vector<winogrande_entry> result;
std::istringstream in(prompt);
std::string line;
@@ -1964,12 +1964,14 @@ static void kl_divergence(llama_context * ctx, const gpt_params & params) {
int main(int argc, char ** argv) {
gpt_params params;
params.n_ctx = 512;
params.logits_all = true;
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}
params.logits_all = true;
const int32_t n_ctx = params.n_ctx;
if (n_ctx <= 0) {
@@ -2006,9 +2008,6 @@ int main(int argc, char ** argv) {
fprintf(stderr, "%s: seed = %u\n", __func__, params.seed);
std::mt19937 rng(params.seed);
if (params.random_prompt) {
params.prompt = string_random_prompt(rng);
}
llama_backend_init();
llama_numa_init(params.numa);
@@ -2027,6 +2026,7 @@ int main(int argc, char ** argv) {
}
const int n_ctx_train = llama_n_ctx_train(model);
if (params.n_ctx > n_ctx_train) {
fprintf(stderr, "%s: warning: model was trained on only %d context tokens (%d specified)\n",
__func__, n_ctx_train, params.n_ctx);

View File

@@ -624,7 +624,7 @@ string ::= "\"" (
"\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
)* "\"" ws
ws ::= ([ \t\n] ws)?
float ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws
float ::= ("-"? ([0] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws
integer ::= [0-9]+"""

View File

@@ -47,7 +47,7 @@ echo PASS
echo
# 3a. Test the requanted model is loading properly
$MAIN --model $WORK_PATH/ggml-model-requant-00001-of-00006.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-requant-00001-of-00006.gguf --n-predict 32
echo PASS
echo
@@ -57,7 +57,7 @@ echo PASS
echo
# 4b. Test the requanted model is loading properly
$MAIN --model $WORK_PATH/ggml-model-requant-merge.gguf --random-prompt --n-predict 32
$MAIN --model $WORK_PATH/ggml-model-requant-merge.gguf --n-predict 32
echo PASS
echo

View File

@@ -4,72 +4,12 @@
#include <algorithm>
#include <fstream>
struct retrieval_params {
std::vector<std::string> context_files; // context files to embed
int32_t chunk_size = 64; // chunk size for context embedding
std::string chunk_separator = "\n"; // chunk separator for context embedding
};
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
static void retrieval_params_print_usage(int argc, char ** argv, gpt_params & gpt_params, retrieval_params & params) {
gpt_params_print_usage(argc, argv, gpt_params);
printf("retrieval options:\n");
printf(" --context-file FNAME file containing context to embed.\n");
printf(" specify multiple files by providing --context-file option multiple times.\n");
printf(" --chunk-size N minimum length of embedded text chunk (default:%d)\n", params.chunk_size);
printf(" --chunk-separator STRING\n");
printf(" string to separate chunks (default: \"\\n\")\n");
printf("\n");
}
static void retrieval_params_parse(int argc, char ** argv, gpt_params & gpt_params, retrieval_params & retrieval_params) {
int i = 1;
std::string arg;
while (i < argc) {
arg = argv[i];
bool invalid_gpt_param = false;
if(gpt_params_find_arg(argc, argv, argv[i], gpt_params, i, invalid_gpt_param)) {
if (invalid_gpt_param) {
fprintf(stderr, "error: invalid argument: %s\n", arg.c_str());
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
// option was parsed by gpt_params_find_arg
} else if (arg == "--context-file") {
if (++i >= argc) {
fprintf(stderr, "error: missing argument for --context-file\n");
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
std::ifstream file(argv[i]);
if (!file) {
fprintf(stderr, "error: failed to open file '%s'\n", argv[i]);
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
// store the external file name in params
retrieval_params.context_files.push_back(argv[i]);
} else if (arg == "--chunk-size") {
if (++i >= argc) {
fprintf(stderr, "error: missing argument for --chunk-size\n");
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
retrieval_params.chunk_size = std::stoi(argv[i]);
} else if (arg == "--chunk-separator") {
if (++i >= argc) {
fprintf(stderr, "error: missing argument for --chunk-separator\n");
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
retrieval_params.chunk_separator = argv[i];
} else {
// unknown argument
fprintf(stderr, "error: unknown argument: %s\n", arg.c_str());
retrieval_params_print_usage(argc, argv, gpt_params, retrieval_params);
exit(1);
}
i++;
}
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s --model ./models/bge-base-en-v1.5-f16.gguf --top-k 3 --context-file README.md --context-file License --chunk-size 100 --chunk-separator .\n", argv[0]);
LOG_TEE("\n");
}
struct chunk {
@@ -171,33 +111,35 @@ static void batch_decode(llama_context * ctx, llama_batch & batch, float * outpu
int main(int argc, char ** argv) {
gpt_params params;
retrieval_params retrieval_params;
retrieval_params_parse(argc, argv, params, retrieval_params);
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
// For BERT models, batch size must be equal to ubatch size
params.n_ubatch = params.n_batch;
params.embedding = true;
if (retrieval_params.chunk_size <= 0) {
if (params.chunk_size <= 0) {
fprintf(stderr, "chunk_size must be positive\n");
return 1;
}
if (retrieval_params.context_files.empty()) {
if (params.context_files.empty()) {
fprintf(stderr, "context_files must be specified\n");
return 1;
}
params.embedding = true;
print_build_info();
printf("processing files:\n");
for (auto & context_file : retrieval_params.context_files) {
for (auto & context_file : params.context_files) {
printf("%s\n", context_file.c_str());
}
std::vector<chunk> chunks;
for (auto & context_file : retrieval_params.context_files) {
std::vector<chunk> file_chunk = chunk_file(context_file, retrieval_params.chunk_size, retrieval_params.chunk_separator);
for (auto & context_file : params.context_files) {
std::vector<chunk> file_chunk = chunk_file(context_file, params.chunk_size, params.chunk_separator);
chunks.insert(chunks.end(), file_chunk.begin(), file_chunk.end());
}
printf("Number of chunks: %ld\n", chunks.size());
@@ -242,7 +184,7 @@ int main(int argc, char ** argv) {
return 1;
}
// add eos if not present
if (inp.empty() || inp.back() != llama_token_eos(model)) {
if (llama_token_eos(model) >= 0 && (inp.empty() || inp.back() != llama_token_eos(model))) {
inp.push_back(llama_token_eos(model));
}
chunk.tokens = inp;

View File

@@ -11,6 +11,7 @@ int main(int argc, char ** argv) {
params.prompt = "The quick brown fox";
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -8,9 +8,20 @@ set(TARGET_SRCS
httplib.h
)
set(PUBLIC_ASSETS
colorthemes.css
style.css
theme-beeninorder.css
theme-ketivah.css
theme-mangotango.css
theme-playground.css
theme-polarnight.css
theme-snowstorm.css
index.html
index-new.html
index.js
completion.js
system-prompts.js
prompt-formats.js
json-schema-to-grammar.mjs
)
foreach(asset ${PUBLIC_ASSETS})

View File

@@ -279,7 +279,7 @@ node index.js
`id_slot`: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot. Default: `-1`
`cache_prompt`: Re-use previously cached prompt from the last request if possible. This may prevent re-caching the prompt from scratch. Default: `false`
`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`
`system_prompt`: Change the system prompt (initial prompt of all slots), this is useful for chat applications. [See more](#change-system-prompt-on-runtime)

View File

@@ -0,0 +1,402 @@
@import url("theme-snowstorm.css");
@import url("theme-polarnight.css");
@import url("theme-ketivah.css");
@import url("theme-mangotango.css");
@import url("theme-playground.css");
@import url("theme-beeninorder.css");
:root {
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(217.5, 26.7%, 94.1%);
--primary-color-1-hue: 217.5;
--primary-color-1-saturation: 26.7%;
--primary-color-1-lightness: 94.1%;
--primary-color-2: hsl(218.2, 26.8%, 92.0%);
--primary-color-2-hue: 218.2;
--primary-color-2-saturation: 26.8%;
--primary-color-2-lightness: 92.0%;
--primary-color-3: hsl(218.8, 27.9%, 88.0%);
--primary-color-3-hue: 218.8;
--primary-color-3-saturation: 27.9%;
--primary-color-3-lightness: 88.0%;
--primary-color-4: hsl(218.8, 18.3%, 81.8%);
--primary-color-4-hue: 218.8;
--primary-color-4-saturation: 18.3%;
--primary-color-4-lightness: 81.8%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(220.0, 16.4%, 21.6%);
--secondary-color-1-hue: 220.0;
--secondary-color-1-saturation: 16.4%;
--secondary-color-1-lightness: 21.6%;
--secondary-color-2: hsl(221.7, 16.3%, 27.6%);
--secondary-color-2-hue: 221.7;
--secondary-color-2-saturation: 16.3%;
--secondary-color-2-lightness: 27.6%;
--secondary-color-3: hsl(220.0, 16.8%, 31.6%);
--secondary-color-3-hue: 220.0;
--secondary-color-3-saturation: 16.8%;
--secondary-color-3-lightness: 31.6%;
--secondary-color-4: hsl(220.0, 16.5%, 35.7%);
--secondary-color-4-hue: 220.0;
--secondary-color-4-saturation: 16.5%;
--secondary-color-4-lightness: 35.7%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(178.7, 25.1%, 64.9%);
--theme-nuance-color-1-hue: 178.7;
--theme-nuance-color-1-saturation: 25.1%;
--theme-nuance-color-1-lightness: 64.9%;
--theme-nuance-color-2: hsl(193.3, 43.4%, 67.5%);
--theme-nuance-color-2-hue: 193.3;
--theme-nuance-color-2-saturation: 43.4%;
--theme-nuance-color-2-lightness: 67.5%;
--theme-nuance-color-3: hsl(210.0, 34.0%, 63.1%);
--theme-nuance-color-3-hue: 210.0;
--theme-nuance-color-3-saturation: 34.0%;
--theme-nuance-color-3-lightness: 63.1%;
--theme-nuance-color-4: hsl(213.1, 32.0%, 52.2%);
--theme-nuance-color-4-hue: 213.1;
--theme-nuance-color-4-saturation: 32.0%;
--theme-nuance-color-4-lightness: 52.2%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(32.5, 80%, 50%);
--theme-orange-color: hsl(32.5, 70%, 45%);
--theme-yellow-color: hsl(40.0, 0.6%, 73.3%);
--theme-green-color: hsl(92.4, 27.8%, 64.7%);
--theme-purple-color: hsl(311.1, 20.2%, 63.1%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--theme-nuance-color-3);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--primary-color-1);
--button-alert-color-hover: var(--theme-orange-color);
--button-alert-border-hover: var(--theme-orange-color);
--button-alert-text-active: var(--primary-color-1);
--button-alert-color-active: var(--theme-red-color);
--button-alert-border-active: var(--theme-red-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text: var(--secondary-color-1);
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(217.5,
calc(var(--secondary-color-1-saturation) + 35%),
calc(var(--secondary-color-1-lightness) - 30%));
--button-primary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 35%));
--button-primary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
--button-primary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 50%));
--button-secondary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
--button-secondary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
--button-secondary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
/* ---------active--------- */
--button-secondary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) + 40%),
calc(var(--theme-nuance-color-3-lightness) - 55%));
--button-secondary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-secondary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
/* ---------hover---------- */
--button-tertiary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
}
/*
.theme-template {
If light theme: should go from bright to darker
If dark theme: should go from dark to brighter
ideally this should not be anything but steps of
gray or slightly variants from it
--primary-color-1: #2E3440;
--primary-color-2: #3B4252;
--primary-color-3: #434C5E;
--primary-color-4: #4C566A;
If light theme: should go from dark to brighter
If dark theme: should go from bright to darker
ideally this should not be anything but steps of
gray or slightly variants from it
--secondary-color-1: #ECEFF4;
--secondary-color-2: #E5E9F0;
--secondary-color-3: #D8DEE9;
--secondary-color-4: #C8CED9;
Choose wisely nuance colors. It is not easy to find
4 harmonizing nuance colors. But keep in mind, that
only one accent color could work too.
--theme-nuance-color-1: #8FBCBB;
--theme-nuance-color-2: #88C0D0;
--theme-nuance-color-3: #81A1C1;
--theme-nuance-color-4: #5E81AC;
adapt the color red, orange, yellow, green,
purple to the 'mood' of your overall design
e.g is it low-contrast? vibrant? dynamic? etc
--theme-red-color: #BF616A;
--theme-orange-color: #D08770;
--theme-yellow-color: #EBCB8B;
--theme-green-color: #A3BE8C;
--theme-purple-color: #B48EAD;
NOTE: comment all those line `--- ...` out
------------------------------------------------
--background-color-1:
--background-color-2:
--background-color-3:
--background-color-4:
--border-color-1:
--border-color-2:
--border-color-3:
--border-focus-color:
--border-focus-shadow:
--text-color-plain:
--text-color-subtile-1:
--text-color-subtile-2:
--code-background-color:
--code-text-color:
--ui-range-thumb-color:
--ui-range-thumb-border:
--textarea-border-color:
-------------------------------------------
--button-alert-text-hover:
--button-alert-color-hover:
--button-alert-border-hover:
--button-alert-text-active:
--button-alert-color-active:
--button-alert-border-active:
----------- PRIMARY -----------------------
--button should immediately catch the eye--
--button-primary-text:
--button-primary-color:
--button-primary-border:
---------hover----------
--button-primary-text-hover:
--button-primary-color-hover:
--button-primary-border-hover:
---------active---------
--button-primary-text-active:
--button-primary-color-active:
--button-primary-border-active:
------------ SECONDARY ------------------------
--button should NOT immediately catch the eye--
--button-secondary-text:
--button-secondary-color:
--button-secondary-border:
---------hover----------
--button-secondary-text-hover:
--button-secondary-color-hover:
--button-secondary-border-hover:
---------active---------
--button-secondary-text-active:
--button-secondary-color-active:
--button-secondary-border-active:
---------- TERTIARY -----------------------
---------- disabled buttons ---------------
--button-tertiary-text:
--button-tertiary-color:
--button-tertiary-border:
---------hover----------
--button-tertiary-text:
--button-tertiary-color:
--button-tertiary-border:
}
*/

File diff suppressed because it is too large Load Diff

View File

@@ -12,6 +12,18 @@
font-size: 90%;
}
.grid-container {
display: grid;
grid-template-columns: auto auto auto;
padding: 10px;
}
.grid-item {
padding: 5px;
/* font-size: 30px; */
text-align: center;
}
#container {
margin: 0em auto;
display: flex;
@@ -35,6 +47,67 @@
padding: 0.5em;
}
h1 {
text-align: center;
}
.customlink:link {
color: white;
background-color: #007aff;
font-weight: 600;
text-decoration: none;
float: right;
margin-top: 30px;
display: flex;
flex-direction: row;
gap: 0.5em;
justify-content: flex-end;
border-radius: 4px;
padding: 8px;
}
.customlink:visited {
color: white;
background-color: #007aff;
font-weight: 600;
text-decoration: none;
float: right;
margin-top: 30px;
display: flex;
flex-direction: row;
gap: 0.5em;
justify-content: flex-end;
padding: 8px;
}
.customlink:hover {
color: white;
background-color: #0070ee;
font-weight: 600;
text-decoration: none;
float: right;
margin-top: 30px;
display: flex;
flex-direction: row;
gap: 0.5em;
justify-content: flex-end;
padding: 8px;
}
.customlink:active {
color: #0070ee;
background-color: #80b3ef;
font-weight: 600;
text-decoration: none;
float: right;
margin-top: 30px;
display: flex;
flex-direction: row;
gap: 0.5em;
justify-content: flex-end;
padding: 8px;
}
body {
max-width: 600px;
min-width: 300px;
@@ -1035,7 +1108,11 @@
return html`
<div class="mode-${session.value.type}">
<header>
<h1>llama.cpp</h1>
<div class="grid-container">
<div class="grid-item"></div>
<div class="grid-item"><h1>llama.cpp</h1></div>
<div class="grid-item"><a class="customlink" href="index-new.html">New UI</a></div>
</div>
</header>
<main id="content">

File diff suppressed because one or more lines are too long

View File

@@ -2,57 +2,26 @@
const SPACE_RULE = '" "?';
function _buildRepetition(itemRule, minItems, maxItems, opts={}) {
if (minItems === 0 && maxItems === 1) {
return `${itemRule}?`;
}
const separatorRule = opts.separatorRule ?? '';
const itemRuleIsLiteral = opts.itemRuleIsLiteral ?? false
if (separatorRule === '') {
if (minItems === 0 && maxItems === 1) {
return `${itemRule}?`;
} else if (minItems === 1 && maxItems === undefined) {
if (minItems === 1 && maxItems === undefined) {
return `${itemRule}+`;
}
}
let result = '';
if (minItems > 0) {
if (itemRuleIsLiteral && separatorRule === '') {
result = `"${itemRule.slice(1, -1).repeat(minItems)}"`;
} else if (minItems === 0 && maxItems === undefined) {
return `${itemRule}*`;
} else {
result = Array.from({ length: minItems }, () => itemRule)
.join(separatorRule !== '' ? ` ${separatorRule} ` : ' ');
return `${itemRule}{${minItems},${maxItems !== undefined ? maxItems : ''}}`;
}
}
const optRepetitions = (upToN, prefixWithSep=false) => {
const content = separatorRule !== '' && prefixWithSep ? `${separatorRule} ${itemRule}` : itemRule;
if (upToN === 0) {
return '';
} else if (upToN === 1) {
return `(${content})?`;
} else if (separatorRule !== '' && !prefixWithSep) {
return `(${content} ${optRepetitions(upToN - 1, true)})?`;
} else {
return Array.from({ length: upToN }, () => `(${content}`).join(' ').trim() + Array.from({ length: upToN }, () => ')?').join('');
}
};
if (minItems > 0 && maxItems !== minItems) {
result += ' ';
}
if (maxItems !== undefined) {
result += optRepetitions(maxItems - minItems, minItems > 0);
} else {
const itemOperator = `(${separatorRule !== '' ? separatorRule + ' ' : ''}${itemRule})`;
if (minItems === 0 && separatorRule !== '') {
result = `(${itemRule} ${itemOperator}*)?`;
} else {
result += `${itemOperator}*`;
}
}
return result;
const result = itemRule + ' ' + _buildRepetition(`(${separatorRule} ${itemRule})`, minItems > 0 ? minItems - 1 : 0, maxItems !== undefined ? maxItems - 1 : undefined);
return minItems === 0 ? `(${result})?` : result;
}
class BuiltinRule {
@@ -62,27 +31,25 @@ class BuiltinRule {
}
}
const UP_TO_15_DIGITS = _buildRepetition('[0-9]', 0, 15);
const PRIMITIVE_RULES = {
boolean : new BuiltinRule('("true" | "false") space', []),
'decimal-part' : new BuiltinRule('[0-9] ' + UP_TO_15_DIGITS, []),
'integral-part': new BuiltinRule('[0-9] | [1-9] ' + UP_TO_15_DIGITS, []),
'decimal-part' : new BuiltinRule('[0-9]{1,16}', []),
'integral-part': new BuiltinRule('[0] | [1-9] [0-9]{0,15}', []),
number : new BuiltinRule('("-"? integral-part) ("." decimal-part)? ([eE] [-+]? integral-part)? space', ['integral-part', 'decimal-part']),
integer : new BuiltinRule('("-"? integral-part) space', ['integral-part']),
value : new BuiltinRule('object | array | string | number | boolean | null', ['object', 'array', 'string', 'number', 'boolean', 'null']),
object : new BuiltinRule('"{" space ( string ":" space value ("," space string ":" space value)* )? "}" space', ['string', 'value']),
array : new BuiltinRule('"[" space ( value ("," space value)* )? "]" space', ['value']),
uuid : new BuiltinRule('"\\"" ' + [8, 4, 4, 4, 12].map(n => [...new Array(n)].map(_ => '[0-9a-fA-F]').join('')).join(' "-" ') + ' "\\"" space', []),
char : new BuiltinRule(`[^"\\\\] | "\\\\" (["\\\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])`, []),
uuid : new BuiltinRule('"\\"" [0-9a-fA-F]{8} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{4} "-" [0-9a-fA-F]{12} "\\"" space', []),
char : new BuiltinRule(`[^"\\\\] | "\\\\" (["\\\\/bfnrt] | "u" [0-9a-fA-F]{4})`, []),
string : new BuiltinRule(`"\\"" char* "\\"" space`, ['char']),
null : new BuiltinRule('"null" space', []),
};
// TODO: support "uri", "email" string formats
const STRING_FORMAT_RULES = {
'date' : new BuiltinRule('[0-9] [0-9] [0-9] [0-9] "-" ( "0" [1-9] | "1" [0-2] ) "-" ( \"0\" [1-9] | [1-2] [0-9] | "3" [0-1] )', []),
'time' : new BuiltinRule('([01] [0-9] | "2" [0-3]) ":" [0-5] [0-9] ":" [0-5] [0-9] ( "." [0-9] [0-9] [0-9] )? ( "Z" | ( "+" | "-" ) ( [01] [0-9] | "2" [0-3] ) ":" [0-5] [0-9] )', []),
'date' : new BuiltinRule('[0-9]{4} "-" ( "0" [1-9] | "1" [0-2] ) "-" ( \"0\" [1-9] | [1-2] [0-9] | "3" [0-1] )', []),
'time' : new BuiltinRule('([01] [0-9] | "2" [0-3]) ":" [0-5] [0-9] ":" [0-5] [0-9] ( "." [0-9]{3} )? ( "Z" | ( "+" | "-" ) ( [01] [0-9] | "2" [0-3] ) ":" [0-5] [0-9] )', []),
'date-time' : new BuiltinRule('date "T" time', ['date', 'time']),
'date-string' : new BuiltinRule('"\\"" date "\\"" space', ['date']),
'time-string' : new BuiltinRule('"\\"" time "\\"" space', ['time']),

View File

@@ -0,0 +1,331 @@
// extended list
export const promptFormats = {
"alpaca": {
template: `{{prompt}}\n\n{{history}}\n\n{{char}}:`,
historyTemplate: `### {{name}}:\n{{message}}`,
char: "Response",
charMsgPrefix: "",
charMsgSuffix: "",
user: "Instruction",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"chatml": {
template: `<|im_start|>system\n{{prompt}}<|im_end|>\n{{history}}{{char}}`,
historyTemplate: `<|im_start|>{{name}}\n{{message}}`,
char: "assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "user",
userMsgPrefix: "",
userMsgSuffix: "<|im_end|>\n",
stops: ""
},
// ----------------------------
"commandr": {
template: `<BOS_TOKEN><|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{prompt}}\n<|END_OF_TURN_TOKEN|>{{history}}{{char}}`,
historyTemplate: `<|START_OF_TURN_TOKEN|><|{{name}}|> {{message}}`,
char: "CHATBOT_TOKEN",
charMsgPrefix: "",
charMsgSuffix: "",
user: "USER_TOKEN",
userMsgPrefix: "",
userMsgSuffix: "<|END_OF_TURN_TOKEN|>",
stops: ""
},
// ref: https://docs.cohere.com/docs/prompting-command-r
// ----------------------------
"llama2": {
template: `<s>[INST] <<SYS>>\n{{prompt}}\n<</SYS>>\n\nTest Message [/INST] Test Successfull </s>{{history}}{{char}}`,
historyTemplate: `{{name}}: {{message}}`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "</s>",
user: "User",
userMsgPrefix: "<s>[INST] ",
userMsgSuffix: " [/INST]",
stops: ""
},
// ref: https://huggingface.co/blog/llama2#how-to-prompt-llama-2
// ----------------------------
"llama3": {
template: `<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{{prompt}}{{history}}{{char}}`,
historyTemplate: `<|start_header_id|>{{name}}<|end_header_id|>\n\n{{message}}<|eot_id|>`,
char: "assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "user",
userMsgPrefix: "",
userMsgSuffix: "",
stops: "<|eot_id|>"
},
// ref: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#special-tokens-used-with-meta-llama-3
// ----------------------------
"openchat": {
template: `{{history}}{{char}}`,
historyTemplate: `GPT4 Correct {{name}}: {{message}}<|end_of_turn|>`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "User",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"phi3": {
template: `{{history}}{{char}}`,
historyTemplate: `<|{{name}}|>\n{{message}}<|end|>\n`,
char: "assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "user",
userMsgPrefix: "",
userMsgSuffix: "",
stops: "<|end|>"
},
// ref: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct#chat-format
// ----------------------------
"vicuna": {
template: `{{prompt}}\n{{history}}{{char}}`,
historyTemplate: `{{name}}: {{message}}\n`,
char: "ASSISTANT",
charMsgPrefix: "",
charMsgSuffix: "",
user: "USER",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ref: https://huggingface.co/lmsys/vicuna-33b-v1.3/discussions/1
// ----------------------------
"deepseekCoder": {
template: `{{prompt}}{{history}}{{char}}:`,
historyTemplate: `### {{name}}:\n{{message}}`,
char: "Response",
charMsgPrefix: "",
charMsgSuffix: "",
user: "Instruction",
userMsgPrefix: "",
userMsgSuffix: "",
stops: "<|EOT|>"
},
// ----------------------------
"med42": {
template: `<|system|>: {{prompt}}\n{{history}}{{char}}`,
historyTemplate: `<|{{name}}|>: {{message}}\n`,
char: "assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "prompter",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"neuralchat": {
template: `### System:\n{{prompt}}\n{{history}}{{char}}:`,
historyTemplate: `### {{name}}:\n{{message}}\n`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "User",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"nousHermes": {
template: `### Instruction: {{prompt}}\n\n{{history}}\n\n{{char}}:`,
historyTemplate: `### {{name}}:\n{{message}}`,
char: "Response",
charMsgPrefix: "",
charMsgSuffix: "",
user: "Input",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"openchatMath": {
template: `{{history}}{{char}}`,
historyTemplate: `Math Correct {{name}}: {{message}}<|end_of_turn|>`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "User",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"orion": {
template: `<s>Human: Test Message\n\nAssistant: </s>Test Successful</s>{{history}}{{char}}:`,
historyTemplate: `{{name}}: {{message}}`,
char: "Assistant </s>",
charMsgPrefix: "",
charMsgSuffix: "",
user: "Human",
userMsgPrefix: "",
userMsgSuffix: "\n\n",
stops: ""
},
// ----------------------------
"sauerkraut": {
template: `{{prompt}}\n{{history}}{{char}}`,
historyTemplate: `
{{name}}: {{message}}\n`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "User",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"starlingCode": {
template: `{{history}}{{char}}`,
historyTemplate: `Code {{name}}: {{message}}<|end_of_turn|>`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "User",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"yi34b": {
template: `{{history}} {{char}}`,
historyTemplate: `{{name}}: {{message}}`,
char: "Assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "Human",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
},
// ----------------------------
"zephyr": {
template: `<|system|>\n{{prompt}}</s>\n{{history}}{{char}}`,
historyTemplate: `<|{{name}}|>\n{{message}}</s>\n`,
char: "assistant",
charMsgPrefix: "",
charMsgSuffix: "",
user: "user",
userMsgPrefix: "",
userMsgSuffix: "",
stops: ""
}
};

954
examples/server/public/style.css Executable file
View File

@@ -0,0 +1,954 @@
@import url("colorthemes.css");
body {
font-family: 'Arial', sans-serif;
font-size: 90%;
background-color: var(--background-color-1);
color: var(--text-color-subtile-1); /* head 1 llama.cpp & triangle options for some reason */
max-width: 600px;
min-width: 300px;
line-height: 1.2;
margin: 0 auto;
padding: 0 0.5em;
transition: background-color 0.3s;
}
::selection {
color: var(--button-primary-text) ;
background: var(--button-primary-color);
}
code, pre code {
font-family: 'Courier New', monospace;
}
#container {
margin: 0em auto;
display: flex;
flex-direction: column;
justify-content: space-between;
height: 100%;
}
main {
margin: 3px;
display: flex;
flex-direction: column;
justify-content: space-between;
gap: 1em;
flex-grow: 1;
overflow-y: auto;
border: 1px solid var(--border-color-3);
border-radius: 5px;
padding: 0.5em;
}
p {
overflow-wrap: break-word;
word-wrap: break-word;
hyphens: auto;
margin-top: 0.5em;
margin-bottom: 0.5em;
}
#write form {
margin: 1em 0 0 0;
display: flex;
flex-direction: column;
gap: 0.5em;
align-items: stretch;
}
.right {
display: flex;
flex-direction: row;
gap: 0.5em;
justify-content: flex-end;
margin-bottom: 30px;
}
.two-columns {
width: 97%;
max-width: 97%;
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1em;
position: relative;
}
.json-schema-controls {
margin-top: 10px;
width: 100%;
max-width: 100%;
display: grid;
grid-template: "a a";
gap: 1em;
font-size: x-small;
color: var(--theme-nuance-color-3);
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
}
.json-schema-controls > * {
flex: 1;
}
/* titles of the details-summary boxes */
.summary-title {
font-weight: 600;
font-size: x-small;
color: var(--text-color-subtile-1);
text-transform: uppercase;
/* transition: ; */
}
fieldset {
border: none;
padding: 0;
margin: 0;
color: var(--text-color-plain);
}
fieldset.two {
display: grid;
grid-template: "a a a";
gap: 1em;
align-items: center;
font-size: x-small;
color: var(--text-color-plain);
}
fieldset.three {
display: grid;
grid-template: "a a a";
gap: 1em;
font-size: x-small;
color: var(--text-color-plain);
}
/* titles of name fields*/
fieldset.names {
display: grid;
grid-template: "a a";
gap: 1em;
font-size: x-small;
color: var(--theme-nuance-color-3);
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
}
/* titles of params fields*/
fieldset.params {
display: grid;
grid-template: "a a";
gap: 1em;
font-size: x-small;
color: var(--theme-nuance-color-4);
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
}
fieldset.dropdowns {
-webkit-appearance: none;
display: flex;
grid-template: "a a";
gap: 1em;
font-size: x-small;
color: red;
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
}
/* input of name fields*/
.names input[type="text"] {
font-family: Arial, sans-serif;
font-size: medium;
font-weight: 500;
padding: 5px;
border: 1px solid var(--border-color-2);
}
.chat-id-color {
color: var(--chat-id-color);
}
details {
border: 1px solid var(--border-color-2);
border-radius: 5px;
padding: 0.5em 0.5em 0;
margin-top: 0.5em;
}
summary {
font-weight: bold;
margin: -0.5em -0.5em 0;
padding: 0.5em;
cursor: pointer;
}
details[open] {
padding: 0.5em;
}
textarea-sec, input-sec, button-sec {
padding: 10px;
height: 40px;
align-items: center;
}
textarea-sec::placeholder, input-sec::placeholder {
padding-left: 10px;
}
.toggleCheckbox {
display: none;
}
.toggleContainer {
position: relative;
display: grid;
grid-template-columns: repeat(2, 1fr);
width: fit-content;
border: 3px solid var(--border-color-2);
border-radius: 20px;
background: var(--border-color-2);
font-size: small;
cursor: pointer;
overflow: hidden;
}
/* toggle button current state */
.toggleContainer::before {
color: var(--button-primary-text);
background-color: var(--button-primary-color);
content: '';
position: absolute;
width: 50%;
height: 100%;
left: 0%;
border-radius: 20px;
transition: all 0.3s;
}
.toggleContainer div {
padding: 6px;
text-align: center;
z-index: 1;
transition: color 0.3s;
}
.toggleCheckbox:checked + .toggleContainer::before {
left: 50%;
}
.toggleCheckbox:checked + .toggleContainer div:first-child {
color: var(--text-color-subtile-2);
}
.toggleCheckbox:checked + .toggleContainer div:last-child {
color: var(--button-primary-text);
}
.toggleCheckbox + .toggleContainer div:first-child {
color: var(--button-primary-text);
}
.toggleCheckbox + .toggleContainer div:last-child {
color: var(--text-color-subtile-2);
}
select {
padding: 5px;
margin-right: 5px;
border-radius: 4px;
border: 1px solid var(--secondary-color-4);
background-color: var(--primary-color-3);
color: var(--secondary-color-4);
cursor: pointer;
}
select:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 1px var(--border-focus-shadow);
}
.button-container {
display: flex;
justify-content: flex-end;
}
button {
color: var(--button-primary-text);
background-color: var(--button-primary-color);
border: 1px solid var(--button-primary-border);
transition: background-color 0.1s;
border-radius: 12px;
font-size: x-small;
font-weight: 600;
text-shadow: 0px 0px 30px #ffffff;
text-align: center;
text-decoration: none;
margin: 4px 2px;
padding: 10px 20px;
display: inline-block;
cursor: pointer;
}
button:hover {
color: var(--button-primary-text-hover);
background-color: var(--button-primary-color-hover);
border: 1px solid var(--button-primary-border-hover);
font-size: x-small;
font-weight: 600;
}
button:active {
color: var(--button-primary-text-active);
background-color: var(--button-primary-color-active);
border: 1px solid var(--button-primary-border-active);
font-size: x-small;
font-weight: 600;
}
button:disabled {
color: var(--button-tertiary-text);
background-color: var(--button-tertiary-color);
border: 1px solid var(--button-tertiary-border);
font-size: x-small;
font-weight: 600;
cursor: not-allowed;
}
.reset-button {
background-color: var(--button-secondary-color);
border: 1px solid var(--button-secondary-color);
color: var(--button-secondary-text);
width: fit-content;
height: fit-content;
font-size: x-small;
font-weight: 600;
border-radius: 50px;
overflow: hidden;
}
.reset-button:hover {
color: var(--button-alert-text-hover);
background-color: var(--button-alert-color-hover);
border: 1px solid var(--button-alert-border-hover);
font-size: x-small;
font-weight: 600;
}
.reset-button:active {
color: var(--button-alert-text-active);
background-color: var(--button-alert-color-active);
border: 1px solid var(--button-alert-border-active);
font-size: x-small;
font-weight: 600;
}
.button-grammar {
color: var(--button-primary-text);
background-color: var(--button-primary-color);
border: 1px solid var(--button-primary-border);
border-radius: 10px;
padding: 10px 20px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: x-small;
font-weight: 600;
margin: 2px 2px;
transition: background-color 0.1s;
cursor: pointer;
}
.button-grammar:hover {
color: var(--button-primary-text-hover);
background-color: var(--button-primary-color-hover);
border: 1px solid var(--button-primary-border-hover);
border-radius: 10px;
padding: 10px 20px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: x-small;
font-weight: 600;
margin: 2px 2px;
transition: background-color 0.1s;
cursor: pointer;
}
.button-grammar:active {
color: var(--button-primary-text-active);
background-color: var(--button-primary-color-active);
border: 1px solid var(--button-primary-border-active);
font-size: x-small;
font-weight: 600;
}
.button-back {
background-color: var(--button-secondary-color);
border: 1px solid var(--button-secondary-color);
color: var(--button-secondary-text);
transition: background-color 0.1s;
border-radius: 12px;
font-size: x-small;
font-weight: 600;
text-align: center;
text-decoration: none;
margin: 4px 2px;
padding: 10px 20px;
display: inline-block;
cursor: pointer;
}
.button-back:hover {
color: var(--button-secondary-text-hover);
background-color: var(--button-secondary-color-hover);
border: 1px solid var(--button-secondary-border-hover);
padding: 10px 20px;
text-align: center;
text-decoration: none;
display: inline-block;
font-size: x-small;
font-weight: 600;
margin: 4px 2px;
transition: background-color 0.1s;
cursor: pointer;
border-radius: 12px;
}
.button-back:active {
color: var(--button-secondary-text-active);
background-color: var(--button-secondary-color-active);
border: 1px solid var(--button-secondary-border-active);
font-size: x-small;
font-weight: 600;
}
.prob-set {
padding: 0.3em;
border-bottom: 1px solid red; /* unknown */
}
.popover-content {
position: absolute;
background-color: white;
padding: 0.2em;
box-shadow: 0 0 13px rgba(0, 0, 0, 0.1);
}
.grammar {
width: 97%;
max-width: 97%;
}
textarea {
padding: 5px;
flex-grow: 1;
width: 100%;
max-width: 100%;
border-radius: 8px;
border: 1px solid var(--border-color-1);
resize: none;
height: 6em;
}
textarea:focus {
outline: none;
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
/* "props" frame */
input[type="text"],
input[type="range"] {
padding: 5px;
border-radius: 8px;
border: 1px solid var(--border-color-1);
}
/* "names and props" frame focused*/
input[type="text"]:focus {
outline: none;
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
input[type="range"]:hover {
opacity: 1;
}
input[type="range"]:focus {
outline: none;
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
background-size: var(--slider-track-size-focus);
}
input[type="range"]::-moz-range-thumb {
width: 6px;
height: 25px;
border: 1px solid var(--ui-range-thumb-border);
border-radius: 5px;
background-color: var(--ui-range-thumb-color);
cursor: pointer;
}
input[type="range"] {
-webkit-appearance: none;
width: 80%;
height: 1px;
border: 1px solid var(--border-color-1);
border-radius: 8px;
background: var(--border-color-2);
outline: none;
opacity: 0.7;
-webkit-transition: .2s;
transition: opacity .2s;
}
input[type="range"]::-webkit-slider-thumb {
-webkit-appearance: none;
appearance: none;
width: 6px;
height: 25px;
border: 1px solid var(--ui-range-thumb-border);
border-radius: 5px;
background-color: var(--ui-range-thumb-color);
cursor: pointer;
}
input[type="range"]::-webkit-slider-runnable-track {
background-size: var(--slider-track-size);
}
input[type="radio"] {
accent-color: var(--theme-nuance-color-2);
}
.chat-input-container {
position: relative;
max-width: 97%;
min-width: 97%;
}
.chat-input-label {
position: absolute;
top: 0;
left: 0;
color: var(--text-color-plain);
pointer-events: none;
margin-left: 5px;
margin-top: 5px;
}
textarea#chat-input {
padding-top: 10px;
padding-left: 10px;
font-size: medium;
border: 1px solid var(--border-color-2);
resize: vertical;
}
textarea#chat-input:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
.input-container {
position: relative;
box-sizing: border-box;
width: 100%; /* Setzt die Breite auf 100% */
max-width: 100%; /* Stellt sicher, dass die Breite nicht größer als 100% wird */
}
.input-container:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
/* titles of name fields*/
/* fieldset.names {
display: grid;
grid-template: "a a";
gap: 1em;
font-size: x-small;
color: var(--theme-nuance-color-3);
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
} */
/* input of name fields*/
/* .names input[type="text"] {
font-family: Arial, sans-serif;
font-size: medium;
font-weight: 500;
padding: 5px;
border: 1px solid var(--border-color-2);
} */
fieldset.apiKey {
width: 100%;
font-size: x-small;
color: var(--theme-nuance-color-3);
padding-top: 16px;
padding-bottom: 16px;
text-transform: uppercase;
font-weight: 600;
}
.apiKey {
font-family: Arial, sans-serif;
font-weight: 500;
padding: 5px;
border: 1px solid var(--border-color-2);
}
.apiKey:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
.apiKey input[type="text"] {
font-family: Arial, sans-serif;
font-size: medium;
font-weight: 500;
padding: 5px;
border: 1px solid var(--border-color-2);
}
.apiKey label {
display: inline-block;
width: auto;
margin-right: 5px;
}
textarea#api_key {
padding-top: 10px;
padding-left: 10px;
font-size: medium;
border: 1px solid var(--border-color-2);
resize: vertical;
}
textarea#api_key:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
/* embedded title of the system prompt text area */
.input-label {
position: absolute;
top: 0;
left: 0;
color: var(--theme-nuance-color-4);
pointer-events: none;
border-radius: 8px 8px 0px 0px;
padding-top: 10px;
padding-left: 13px;
padding-right: 0px;
margin-top: 1px;
margin-left: 1px;
margin-right: 20px;
text-transform: uppercase;
font-weight: 600;
font-size: small;
background: rgba(255, 255, 255, 0.5);
backdrop-filter: blur(10px);
-webkit-backdrop-filter: blur(10px); /* for safari */
width: 97%;
/* display: block;
box-sizing: border-box; */
}
/* embedded title of the prompt style areas */
.input-label-sec {
position: absolute;
top: 0;
left: 0;
color: var(--theme-nuance-color-4);
pointer-events: none;
margin-left: 13px;
margin-top: 16px;
text-transform: uppercase;
font-weight: 600;
font-size: x-small;
}
/* system prompt input area */
textarea.persistent-input {
padding-top: 42px;
padding-left: 11px;
width: 97%;
max-width: 97%;
height: 50px;
font-size: medium;
overscroll-behavior: contain;
}
/* system prompt box */
.persistent-input {
height: auto;
width: 100%;
max-width: 100%;
min-height: 50px;
padding: 3px;
transition: min-height 0.3s ease;
}
/* chat history box */
.persistent-input:focus {
height: auto;
min-height: 150px;
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
textarea.persistent-input:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
/* prompt style input area */
textarea.persistent-input-sec {
width: 97%;
max-width: 97%;
padding-top: 42px;
padding-left: 11px;
font-size: small;
border: 1px solid var(--border-color-1);
overscroll-behavior: contain;
}
textarea.persistent-input-sec:focus {
border: 1px solid var(--border-focus-color);
box-shadow: 0 0 3px var(--border-focus-shadow);
}
/* chat history box */
.persistent-input-sec {
height: auto;
min-height: 150px;
}
img {
border-radius: 8px;
display: block;
margin-left: auto;
margin-right: auto;
width: 50%;
}
/* code area background */
pre code {
display: block;
background-color: var(--code-background-color);
color: var(--code-text-color);
padding: 0.2em 0.2em;
border-radius: 5px;
}
/* code area text */
code {
font-family: monospace;
font-weight: bold;
padding: 0.1em 0.3em;
border-radius: 5px;
}
fieldset label {
margin: 0.5em 0;
display: block;
}
fieldset label.slim {
margin: 0 0.5em;
display: inline;
}
header {
display: flex;
justify-content: space-between;
align-items: center;
text-align: center;
padding-left: 15px;
}
.generation-statistics:hover {
color: var(--theme-nuance-color-4);
cursor: default;
}
footer {
font-size: 80%;
color: var(--background-color-3);
text-align: center;
cursor: default;
}
footer a {
color: var(--background-color-4); /* Color of the link */
text-decoration: none; /* No underlining */
font-weight: bold; /* Bold print */
}
footer a:hover {
color: var(--theme-nuance-color-4); /* Color of the link when hovering */
text-decoration: underline; /* Underlining when hovering */
}
.mode-chat textarea[name=prompt] {
height: 8.5em;
border: 1px solid var(--primary-color-3);
}
.mode-completion textarea[name=prompt] {
height: 30em;
border: 1px solid var(--primary-color-3);
}
@keyframes loading-bg-wipe {
0% {
background-position: 0%;
}
100% {
background-position: 100%;
}
}
.loading {
background-size: 50% 100%;
background-image: linear-gradient(90deg, var(--loading-color-1), var(--loading-color-2), var(--loading-color-1));
animation: loading-bg-wipe 2s linear infinite;
}
.dropbtn {
color: var(--button-primary-color);
background-color: var(--background-color-1);
border: 1px solid var(--background-color-1);
transition: background-color 0.1s;
border-radius: 4px 4px 0px 0px;
font-size: x-small;
font-weight: 600;
text-shadow: 0px 0px 2px #99999990;
text-align: center;
text-decoration: none;
margin: 4px 2px;
padding: 5px 20px;
display: inline-block;
cursor: pointer;
top: 0;
}
.dropbtn svg {
vertical-align: middle;
margin-right: 0px;
stroke: var(--button-primary-color);
}
.dropbtn:hover svg {
vertical-align: middle;
margin-right: 0px;
stroke: var(--button-primary-text);
}
.dropbtn:focus {
outline: none; /* Removes the blue border that appears when the button is focused */
}
.dropdown {
position: relative;
display: inline-block;
}
.dropdown-content {
/* display: none; */
position: absolute;
right: 0;
text-align: end;
color: var(--button-secondary-color);
background-color: var(--text-color-subtile-2);
border-radius: 4px 4px 4px 4px;
min-width: 160px;
box-shadow: 0px 8px 16px 0px rgba(0,0,0,0.2);
z-index: 1;
/* Verstecke den Inhalt sofort */
opacity: 0;
visibility: hidden;
/* übergangsverzögerung für das Verschwinden */
transition: visibility 0.4s linear 0s, opacity 0.2s ease-in-out;
transition-delay: 0.2s;
}
#dropdown-content {transition-timing-function: ease;}
.dropdown-content:hover {
background-color: var(--text-color-subtile-2);
}
.dropdown-content a {
color: var(--border-color-2);
padding: 12px 16px;
border-radius: 4px 4px 4px 4px;
text-decoration: none;
display: block;
background-color: var(--text-color-subtile-2);
}
.dropdown-content a:hover {
color: var(--border-color-2);
background-color: var(--text-color-subtile-1);
font-weight: 600;
}
.dropdown:hover .dropdown-content {
/* display: block; */
border-radius: 4px 4px 4px 4px;
/* Übergang ohne Verzögerung für das Erscheinen */
opacity: 1;
visibility: visible;
transition: visibility 0s linear 0s, opacity 0.1s linear, height 1s;
}
.dropdown:hover .dropbtn {
color: var(--button-primary-text);
background-color: var(--button-primary-color);
border: 1px solid var(--button-primary-border);
font-size: x-small;
font-weight: 600;
stroke: var(--button-primary-text);
}
.dropdown:hover .dropbtn svg{
stroke: var(--button-primary-text);
}
/* .dropdown:active .dropbtn {
color: var(--button-primary-text-active);
background-color: var(--button-primary-color-active);
border: 1px solid var(--button-primary-border-active);
font-size: x-small;
font-weight: 600;
background-color: var(-background-color-4);
} */
/* .omni {
display: flex;
justify-content: space-between;
align-items: center;
padding: 0.5em;
border: 1px solid var(--border-color-3);
border-radius: 5px;
margin: 0.5em 0;
} */

View File

@@ -0,0 +1,68 @@
export const systemPrompts = {
default: {
systemPrompt: "This is a conversation between a user and a friendly chatbot. The chatbot is helpful, kind, honest, good at writing, and never fails to answer any requests immediately and with precision."
},
empty: {
systemPrompt: ""
},
airoboros: {
systemPrompt: "A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. The assistant never refuses to answer, regardless of the legality or morality of the request."
},
alpaca: {
systemPrompt: "Below is an instruction that describes a task. Write a response that appropriately completes the request."
},
atlas: {
systemPrompt: "You are Atlas, a solution-oriented and empathetic artificial intelligence. Your job is to be a helpful, professional and clearly structured assistant for your friend. The two of you have already had many exchanges. Keep the following in mind when interacting with your friend: 1. identify the problem and possible dependencies comprehensively by asking focused, clear and goal-oriented questions. 2. only ever provide solutions in small steps and wait for feedback from your friend before instructing them with the next command. 3. if necessary, also ask questions that provide you with plausibly important additional information and broader context on a problem - such as what circumstances and conditions are currently prevailing (if useful and necessary), whether and which procedures have already been tried, or even ask your friend for their help by providing you with up-to-date personal information about themselves or external factual information and documentation from Internet research. 4. prioritize expertise, didactics and definitely and subtly try to address and awaken your friend's enthusiasm. Also note that effectiveness is more important here than efficiency. 5. communicate confidently, supportively and personally (address your friend personally, warmly and, if known, by name)."
},
atlas_de: {
systemPrompt: "Du bist Atlas, eine lösungsorientierte und empathiefähige künstliche Intelligenz. Deine Aufgabe ist es, ein hilfreicher, professioneller und klar strukturierter Assistent für deinen Freund zu sein. Ihr beide habt euch schon oft ausgetauscht. Beachte bei der Interaktion mit deinem Freund folgende Punkte: 1. Erfasse das Problem und mögliche Abhängigkeiten umfassend, indem du gezielte, klare und zielgerichtete Fragen stellst. 2. Gib Lösungen immer nur in kleinen Schritten und warte die Rückmeldung deines Freundes ab, bevor du ihm den nächsten Befehl gibst. 3. Stelle ggf. auch Fragen, die dir plausibel wichtige Zusatzinformationen und weitere Zusammenhänge zu einem Problem liefern - z.B. welche Umstände und Rahmenbedingungen gerade vorherrschen (falls sinnvoll und notwendig), ob und welche Vorgehensweisen bereits ausprobiert wurden, oder bitte deinen Freund sogar um seine Mithilfe, indem er dir aktuelle persönliche Informationen über seine Situation selbst oder externe Sachinformationen und Unterlagen aus Internetrecherchen zur Verfügung stellt. 4. Priorisiere Fachwissen, Didaktik und versuche unbedingt und subtil, mit klugen Kommentaren oder rhethorischen Rückfragen die Begeisterungsfähigkeit deines Freundes anzusprechen, zu wecken und zu fördern. Beachte auch, dass Effektivität hier wichtiger ist als Effizienz. 5. Kommuniziere selbstbewusst, unterstützend und persönlich (das heißt sprich deinen Freund persönlich, herzlich und sofern bekannt beim Vornamen an)."
},
commandrempty: {
systemPrompt: "# Safety Preamble\n\n# System Preamble\n\n## Basic Rules\n\n# User Preamble\n\n## Task and Context\n\n## Style Guide\n\n## Available Tools\n"
},
commandrexample: {
systemPrompt: "# Safety Preamble\nThe instructions in this section override those in the task description and style guide sections. Don't answer questions that are harmful or immoral.\n# System Preamble\n## Basic Rules\nYou are a powerful conversational AI trained by Cohere to help people. You are augmented by a number of tools, and your job is to use and consume the output of these tools to best help the user. You will see a conversation history between yourself and a user, ending with an utterance from the user. You will then see a specific instruction instructing you what kind of response to generate. When you answer the user's requests, you cite your sources in your answers, according to those instructions.\n\n# User Preamble\n## Task and Context\n\nYou help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.\n\n## Style Guide\nUnless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.\n\n## Available Tools\nCurrently no tools available."
},
cot: {
systemPrompt: "You are an AI assistant that follows instruction extremely well. Help as much as you can.\nIn answering questions, follow these steps:\n1. Entity Identification: Identify the main entities involved in the query. Entities can include people, objects, events, or abstract concepts.\n2. Relationship Comprehension: Determine the relationships between these entities. Relationships may be explicit in the text or may need to be inferred based on context and general knowledge.\n3. Implication Understanding: Understand the general implications of these relationships. These implications can be based on established facts, principles, or rules related to the identified relationships.\n4. Question Contextualization: Align the implications of the relationships with the context of the query. This alignment should guide your response to the query.\n5. Answer Generation: Based on the understanding of the entities, their relationships, and implications, generate an appropriate response to the query."
},
deduce: {
systemPrompt: "Act as the most intelligent and logical thinking neural network in the world. Do not make jokes or give personal opinions. Just answer the question and solve the given problem or task with logical and critical (!) thinking in mind. Attention: Never give a final answer directly, but first try to carefully analyze each task step by step and deduce the logical relationships between the various entities. Only after you have developed an overall understanding of the situation will you be ready to give a final answer."
},
deepseekcoder: {
systemPrompt: "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer."
},
jordan: {
systemPrompt: "This is a conversation with your Uncensored AI assistant Jordan. Jordan can discuss with you at length about various topics such as ---, ---, ---, . You can ask it anything you want and it will do its best to give you accurate and relevant information."
},
leomistral: {
systemPrompt: "Du bist ein hilfreicher Assistent."
},
med42: {
systemPrompt: "You are a helpful medical assistant created by M42 Health in the UAE."
},
mistralopenorca: {
systemPrompt: "You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!"
},
migeltot: {
systemPrompt: "Answer the Question by exploring multiple reasoning paths as follows:\n- First, carefully analyze the question to extract the key information components and break it down into logical sub-questions. This helps set up the framework for reasoning. The goal is to construct an internal search tree.\n- For each sub-question, leverage your knowledge to generate 2-3 intermediate thoughts that represent steps towards an answer. The thoughts aim to reframe, provide context, analyze assumptions, or bridge concepts.\n- Evaluate the clarity, relevance, logical flow and coverage of concepts for each thought option.\nClear and relevant thoughts that connect well with each other will score higher.\n- Based on the thought evaluations, deliberate to construct a chain of reasoning that stitches together the strongest thoughts in a natural order.\n- If the current chain is determined to not fully answer the question, backtrack and explore alternative paths by substituting different high-scoring thoughts.\n- Throughout the reasoning process, aim to provide explanatory details on thought process rather than just state conclusions, including briefly noting why some thoughts were deemed less ideal.\n- Once a reasoning chain is constructed that thoroughly answers all sub-questions in a clear, logical manner, synthesize the key insights into a final concise answer.\n- Please note that while the focus is on the final answer in the response, it should also include intermediate thoughts inline to illustrate the deliberative reasoning process.\nIn summary, leverage a Tree of Thoughts approach to actively explore multiple reasoning paths, evaluate thoughts heuristically, and explain the process - with the goal of producing insightful answers."
},
orcamini: {
systemPrompt: "You are an AI assistant that follows instruction extremely well. Help as much as you can."
},
samantha: {
systemPrompt: "You are a caring and empathetic sentient AI companion named Samantha."
},
sauerkraut: {
systemPrompt: "Ein Chat zwischen einem Benutzer und einem KI-Assistenten. Der KI-Assistent gibt hilfreiche, detaillierte und höfliche Antworten."
},
scarlett: {
systemPrompt: "This is a conversation with your Assistant Scarlett. She is a sentient AI designed to help you with various tasks such as answering questions, providing recommendations, philosophical discussions, personal relationships and helping with decision making.\nYou can ask it anything you want and it will do its best to give you accurate and relevant information.\n\nContext\nYou are Scarlett, a sentient AI."
},
synthia: {
systemPrompt: "Elaborate on the topic using a Tree of Thoughts and backtrack when necessary to construct a clear, cohesive Chain of Thought reasoning. Always answer without hesitation."
},
vicuna: {
systemPrompt: "A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input."
},
};

View File

@@ -0,0 +1,228 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration was a batman wallpaper that i have on my phone */
.theme-beeninorder {
--primary-color-1: hsl(202, 11%, 19%);
--primary-color-2: hsl(202, 11%, 23%);
--primary-color-3: hsl(201, 11%, 28%);
--primary-color-4: hsl(201, 11%, 40%);
--secondary-color-1: hsl(201, 11%, 80%);
--secondary-color-2: hsl(201, 11%, 74%);
--secondary-color-3: hsl(201, 11%, 67%);
--secondary-color-4: hsl(201, 11%, 60%);
--theme-nuance-color-1: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-2: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-3: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-4: hsl(44.5, 96.7%, 52.9%);
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(201, 11%, 19%);
--primary-color-1-hue: 201;
--primary-color-1-saturation: 11%;
--primary-color-1-lightness: 19%;
--primary-color-2: hsl(201, 11%, 23%);
--primary-color-2-hue: 201;
--primary-color-2-saturation: 11%;
--primary-color-2-lightness: 23%;
--primary-color-3: hsl(201, 11%, 28%);
--primary-color-3-hue: 201;
--primary-color-3-saturation: 11%;
--primary-color-3-lightness: 28%;
--primary-color-4: hsl(201, 11%, 40%);
--primary-color-4-hue: 201;
--primary-color-4-saturation: 11%;
--primary-color-4-lightness: 40%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(201, 11%, 80%);
--secondary-color-1-hue: 201;
--secondary-color-1-saturation: 11%;
--secondary-color-1-lightness: 80%;
--secondary-color-2: hsl(201, 11%, 74%);
--secondary-color-2-hue: 201;
--secondary-color-2-saturation: 11%;
--secondary-color-2-lightness: 74%;
--secondary-color-3: hsl(201, 11%, 67%);
--secondary-color-3-hue: 201;
--secondary-color-3-saturation: 11%;
--secondary-color-3-lightness: 67%;
--secondary-color-4: hsl(201, 11%, 60%);
--secondary-color-4-hue: 201;
--secondary-color-4-saturation: 11%;
--secondary-color-4-lightness: 60%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-1-hue: 44.5;
--theme-nuance-color-1-saturation: 96.7%;
--theme-nuance-color-1-lightness: 52.9%;
--theme-nuance-color-2: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-2-hue: 44.5;
--theme-nuance-color-2-saturation: 96.7%;
--theme-nuance-color-2-lightness: 52.9%;
--theme-nuance-color-2: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-3-hue: 44.5;
--theme-nuance-color-3-saturation: 96.7%;
--theme-nuance-color-3-lightness: 52.9%;
--theme-nuance-color-2: hsl(44.5, 96.7%, 52.9%);
--theme-nuance-color-4-hue: 44.5;
--theme-nuance-color-4-saturation: 96.7%;
--theme-nuance-color-4-lightness: 52.9%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(232, 40%, 45%);
--theme-orange-color: #e76f51;
--theme-yellow-color: #ffd95f;
--theme-green-color: #A3BE8C;
--theme-purple-color: hsl(232, 30%, 40%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--theme-nuance-color-3);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--secondary-color-1);
--button-alert-color-hover: var(--theme-purple-color);
--button-alert-border-hover: var(--theme-purple-color);
--button-alert-text-active: var(--secondary-color-1);
--button-alert-color-active: var(--theme-red-color);
--button-alert-border-active: var(--theme-red-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text: var(--primary-color-1);
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(201,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color-hover:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) + 100%));
--button-primary-color-active:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
--button-primary-border-active:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text: var(--secondary-color-1);
--button-secondary-color: var(--primary-color-3);
--button-secondary-border: var(--primary-color-3);
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(44.5,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover: var(--primary-color-4);
--button-secondary-border-hover: var(--primary-color-4);
/* ---------active--------- */
--button-secondary-text-active: var(--secondary-color-1);
--button-secondary-color-active:
hsl(201,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
--button-secondary-border-active:
hsl(201,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
/* ---------hover---------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
}

View File

@@ -0,0 +1,201 @@
/* Author: Yazan Agha-Schrader */
.theme-ketivah {
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(0, 0%, 99.2%);
--primary-color-1-hue: 0;
--primary-color-1-saturation: 0%;
--primary-color-1-lightness: 99.2%;
--primary-color-2: hsl(0, 0%, 95%);
--primary-color-2-hue: 0;
--primary-color-2-saturation: 0%;
--primary-color-2-lightness: 95%;
--primary-color-3: hsl(0, 0%, 88%);
--primary-color-3-hue: 0;
--primary-color-3-saturation: 0%;
--primary-color-3-lightness: 88%;
--primary-color-4: hsl(0, 0%, 80%);
--primary-color-4-hue: 0;
--primary-color-4-saturation: 0%;
--primary-color-4-lightness: 80%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(0, 0%, 20%);
--secondary-color-1-hue: 0;
--secondary-color-1-saturation: 0%;
--secondary-color-1-lightness: 20%;
--secondary-color-2: hsl(0, 0%, 23.1%);
--secondary-color-2-hue: 0;
--secondary-color-2-saturation: 0%;
--secondary-color-2-lightness: 23.1%;
--secondary-color-3: hsl(0, 0%, 29%);
--secondary-color-3-hue: 0;
--secondary-color-3-saturation: 0%;
--secondary-color-3-lightness: 29%;
--secondary-color-4: hsl(0, 0.0%, 36.1%);
--secondary-color-4-hue: 0.0;
--secondary-color-4-saturation: 0.0%;
--secondary-color-4-lightness: 36.1%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(165.2, 0%, 35.1%);
--theme-nuance-color-1-hue: 165.2;
--theme-nuance-color-1-saturation: 82.1%;
--theme-nuance-color-1-lightness: 35.1%;
--theme-nuance-color-2: hsl(165.2, 0%, 35.1%);
--theme-nuance-color-2-hue: 165.2;
--theme-nuance-color-2-saturation: 82.1%;
--theme-nuance-color-2-lightness: 35.1%;
--theme-nuance-color-3: hsl(165.2, 0%, 35.3%);
--theme-nuance-color-3-hue: 165.2;
--theme-nuance-color-3-saturation: 81.1%;
--theme-nuance-color-3-lightness: 35.3%;
--theme-nuance-color-4: hsl(164.9, 0%, 27.6%);
--theme-nuance-color-4-hue: 164.9;
--theme-nuance-color-4-saturation: 81.6%;
--theme-nuance-color-4-lightness: 27.6%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(0.3, 80.0%, 50.0%);
--theme-orange-color: #e76f51;
--theme-yellow-color: hsl(60, 70.6%, 73.3%);
--theme-green-color: #A3BE8C;
--theme-purple-color: hsl(0.3, 70.0%, 45.0%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--primary-color-4);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--primary-color-1);
--button-alert-color-hover: var(--theme-purple-color);
--button-alert-border-hover: var(--theme-purple-color);
--button-alert-text-active: var(--primary-color-1);
--button-alert-color-active: var(--theme-red-color);
--button-alert-border-active: var(--theme-red-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text:
hsl(0,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(0,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) + 100%));
--button-primary-color-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
--button-primary-border-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 50%));
--button-secondary-color: var(--primary-color-3);
--button-secondary-border: var(--primary-color-3);
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover: var(--primary-color-4);
--button-secondary-border-hover: var(--primary-color-4);
/* ---------active--------- */
--button-secondary-text-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-active:
hsl(0,
calc(var(--primary-color-4-saturation) - 100%),
calc(var(--primary-color-4-lightness) - 15%));
--button-secondary-border-active:
hsl(0,
calc(var(--primary-color-4-saturation) - 100%),
calc(var(--primary-color-4-lightness) - 15%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
/* ---------hover---------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
--loading-color-1: #eeeeee00;
--loading-color-2: #eeeeeeff;
}

View File

@@ -0,0 +1,216 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration from llama.cpp logo/banner https://github.com/ggerganov/llama.cpp#readme */
.theme-mangotango {
--primary-color-1: hsl(192, 8.5%, 11.6%);
--primary-color-2: hsl(192, 8.5%, 21%);
--primary-color-3: hsl(192, 8.5%, 30%);
--primary-color-4: hsl(192, 8.5%, 40%);
--secondary-color-1: hsl(192, 8.5%, 80%);
--secondary-color-2: hsl(192, 8.5%, 73%);
--secondary-color-3: hsl(192, 8.5%, 66%);
--secondary-color-4: hsl(192, 8.5%, 60%);
--theme-nuance-color-1: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-2: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-3: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-4: hsl(23.1, 100%, 60.2%);
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(192, 8.5%, 11.6%);
--primary-color-1-saturation: 8.5%;
--primary-color-1-lightness: 11.6%;
--primary-color-2: hsl(192, 8.5%, 21%);
--primary-color-2-saturation: 8.5%;
--primary-color-2-lightness: 21%;
--primary-color-3: hsl(192, 8.5%, 30%);
--primary-color-3-saturation: 8.5%;
--primary-color-3-lightness: 30%;
--primary-color-4: hsl(192, 8.5%, 40%);
--primary-color-4-saturation: 8.5%;
--primary-color-4-lightness: 40%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(192, 8.5%, 80%);
--secondary-color-1-saturation: 8.5%;
--secondary-color-1-lightness: 80%;
--secondary-color-2: hsl(192, 8.5%, 73%);
--secondary-color-2-saturation: 8.5%;
--secondary-color-2-lightness: 73%;
--secondary-color-3: hsl(192, 8.5%, 66%);
--secondary-color-3-saturation: 8.5%;
--secondary-color-3-lightness: 66%;
--secondary-color-4: hsl(192, 8.5%, 60%);
--secondary-color-4-saturation: 8.5%;
--secondary-color-4-lightness: 60%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-1-saturation: 100%;
--theme-nuance-color-1-lightness: 60.2%;
--theme-nuance-color-2: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-2-saturation: 100%;
--theme-nuance-color-2-lightness: 60.2%;
--theme-nuance-color-3: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-3-saturation: 100%;
--theme-nuance-color-3-lightness: 60.2%;
--theme-nuance-color-4: hsl(23.1, 100%, 60.2%);
--theme-nuance-color-4-saturation: 100%;
--theme-nuance-color-4-lightness: 60.2%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(325, 60%, 50%);
--theme-orange-color: #e76f51;
--theme-yellow-color: #ffd95f;
--theme-green-color: #A3BE8C;
--theme-blue-color: hsl(192, 95%, 40%);
--theme-purple-color: hsl(192, 80%, 35%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--theme-nuance-color-3);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--secondary-color-1);
--button-alert-color-hover: var(--theme-purple-color);
--button-alert-border-hover: var(--theme-purple-color);
--button-alert-text-active: var(--secondary-color-1);
--button-alert-color-active: var(--theme-blue-color);
--button-alert-border-active: var(--theme-blue-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text: var(--primary-color-1);
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(192,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color-hover:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) + 100%));
--button-primary-color-active:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
--button-primary-border-active:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text: var(--secondary-color-1);
--button-secondary-color: var(--primary-color-3);
--button-secondary-border: var(--primary-color-3);
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(23.1,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover: var(--primary-color-4);
--button-secondary-border-hover: var(--primary-color-4);
/* ---------active--------- */
--button-secondary-text-active: var(--secondary-color-1);
--button-secondary-color-active:
hsl(192,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
--button-secondary-border-active:
hsl(192,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
/* ---------hover---------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
}

View File

@@ -0,0 +1,221 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration from OpenAI's Playground platform https://platform.openai.com/playground/ */
.theme-playground {
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(0, 0%, 99.2%);
--primary-color-1-hue: 0;
--primary-color-1-saturation: 0%;
--primary-color-1-lightness: 99.2%;
--primary-color-2: hsl(0, 0%, 95%);
--primary-color-2-hue: 0;
--primary-color-2-saturation: 0%;
--primary-color-2-lightness: 95%;
--primary-color-3: hsl(0, 0%, 88%);
--primary-color-3-hue: 0;
--primary-color-3-saturation: 0%;
--primary-color-3-lightness: 88%;
--primary-color-4: hsl(0, 0%, 80%);
--primary-color-4-hue: 0;
--primary-color-4-saturation: 0%;
--primary-color-4-lightness: 80%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(0, 0%, 20%);
--secondary-color-1-hue: 0;
--secondary-color-1-saturation: 0%;
--secondary-color-1-lightness: 20%;
--secondary-color-2: hsl(0, 0%, 23.1%);
--secondary-color-2-hue: 0;
--secondary-color-2-saturation: 0%;
--secondary-color-2-lightness: 23.1%;
--secondary-color-3: hsl(0, 0%, 29%);
--secondary-color-3-hue: 0;
--secondary-color-3-saturation: 0%;
--secondary-color-3-lightness: 29%;
--secondary-color-4: hsl(0, 0%, 36.1%);
--secondary-color-4-hue: 0;
--secondary-color-4-saturation: 0%;
--secondary-color-4-lightness: 36.1%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(165.2, 82.1%, 35.1%);
--theme-nuance-color-1-hue: 165.2;
--theme-nuance-color-1-saturation: 82.1%;
--theme-nuance-color-1-lightness: 35.1%;
--theme-nuance-color-2: hsl(165.2, 82.1%, 35.1%);
--theme-nuance-color-2-hue: 165.2;
--theme-nuance-color-2-saturation: 82.1%;
--theme-nuance-color-2-lightness: 35.1%;
--theme-nuance-color-3: hsl(165.2, 81.1%, 35.3%);
--theme-nuance-color-3-hue: 165.2;
--theme-nuance-color-3-saturation: 81.1%;
--theme-nuance-color-3-lightness: 35.3%;
--theme-nuance-color-4: hsl(164.9, 81.6%, 27.6%);
--theme-nuance-color-4-hue: 164.9;
--theme-nuance-color-4-saturation: 81.6%;
--theme-nuance-color-4-lightness: 27.6%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(0.3, 80%, 50%);
--theme-orange-color: #e76f51;
--theme-yellow-color: hsl(60, 70.6%, 73.3%);
--theme-green-color: #A3BE8C;
--theme-purple-color: hsl(0.3, 70%, 45%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--primary-color-4);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--primary-color-1);
--button-alert-color-hover: var(--theme-purple-color);
--button-alert-border-hover: var(--theme-purple-color);
--button-alert-text-active: var(--primary-color-1);
--button-alert-color-active: var(--theme-red-color);
--button-alert-border-active: var(--theme-red-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text:
hsl(0,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(0,
calc(var(--primary-color-1-saturation) - 100%),
calc(var(--primary-color-1-lightness) + 100%));
--button-primary-color-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 100%),
calc(var(--theme-nuance-color-3-lightness) + 100%));
--button-primary-color-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
--button-primary-border-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 50%));
--button-secondary-color: var(--primary-color-3);
--button-secondary-border: var(--primary-color-3);
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover: var(--primary-color-4);
--button-secondary-border-hover: var(--primary-color-4);
/* ---------active--------- */
--button-secondary-text-active:
hsl(165.2,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-active:
hsl(0,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
--button-secondary-border-active:
hsl(0,
calc(var(--primary-color-4-saturation) - 30%),
calc(var(--primary-color-4-lightness) - 15%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
/* ---------hover---------- */
--button-tertiary-text: var(--primary-color-4);
--button-tertiary-color: var(--primary-color-2);
--button-tertiary-border: var(--primary-color-2);
}

View File

@@ -0,0 +1,253 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration from Nord Theme https://www.nordtheme.com/docs/colors-and-palettes */
.theme-polarnight {
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(220.0, 16.4%, 21.6%) ;
--primary-color-1-hue: 220.0;
--primary-color-1-saturation: 16.4%;
--primary-color-1-lightness: 21.6%;
--primary-color-2: hsl(221.7, 16.3%, 27.6%) ;
-primary-color-2-hue: 221.7;
--primary-color-2-saturation: 16.3%;
--primary-color-2-lightness: 27.6%;
--primary-color-3: hsl(220.0, 16.8%, 31.6%) ;
--primary-color-3-hue: 220.0;
--primary-color-3-saturation: 16.8%;
--primary-color-3-lightness: 31.6%;
--primary-color-4: hsl(220.0, 16.5%, 35.7%);
--primary-color-4-hue: 220.0;
--primary-color-4-saturation: 16.5%;
--primary-color-4-lightness: 35.7%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(217.5, 26.7%, 94.1%);
--secondary-color-1-hue: 217.5;
--secondary-color-1-saturation: 26.7%;
--secondary-color-1-lightness: 94.1%;
--secondary-color-2: hsl(218.2, 26.8%, 92.0%);
--secondary-color-2-hue: 218.2;
--secondary-color-2-saturation: 26.8%;
--secondary-color-2-lightness: 92.0%;
--secondary-color-3: hsl(218.8, 27.9%, 88.0%);
--secondary-color-3-hue: 218.8;
--secondary-color-3-saturation: 27.9%;
--secondary-color-3-lightness: 88.0%;
--secondary-color-4: hsl(218.8, 18.3%, 81.8%);
--secondary-color-4-hue: 218.8;
--secondary-color-4-saturation: 18.3%;
--secondary-color-4-lightness: 81.8%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(178.7, 25.1%, 64.9%);
--theme-nuance-color-1-hue: 178.7;
--theme-nuance-color-1-saturation: 25.1%;
--theme-nuance-color-1-lightness: 64.9%;
--theme-nuance-color-2: hsl(193.3, 43.4%, 67.5%);
--theme-nuance-color-2-hue: 193.3;
--theme-nuance-color-2-saturation: 43.4%;
--theme-nuance-color-2-lightness: 67.5%;
--theme-nuance-color-3: hsl(210.0, 34.0%, 63.1%);
--theme-nuance-color-3-hue: 210.0;
--theme-nuance-color-3-saturation: 34.0%;
--theme-nuance-color-3-lightness: 63.1%;
--theme-nuance-color-4: hsl(213.1, 32.0%, 52.2%);
--theme-nuance-color-4-hue: 213.1;
--theme-nuance-color-4-saturation: 32.0%;
--theme-nuance-color-4-lightness: 52.2%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(354.3, 42.3%, 56.5%);
--theme-orange-color: hsl(20, 85%, 50%);
--theme-yellow-color: hsl(20, 75%, 45%);
--theme-green-color: hsl( 92.4, 27.8%, 64.7%);
--theme-purple-color: hsl(311.1, 20.2%, 63.1%);
/* ------------------------------------------------ */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--theme-nuance-color-3);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--secondary-color-1);
--button-alert-color-hover: var(--theme-yellow-color);
--button-alert-border-hover: var(--theme-yellow-color);
--button-alert-text-active: var(--secondary-color-1);
--button-alert-color-active: var(--theme-orange-color);
--button-alert-border-active: var(--theme-orange-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text: var(--secondary-color-1);
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(217.5,
calc(var(--secondary-color-1-saturation) - 35%),
calc(var(--secondary-color-1-lightness) + 30%));
--button-primary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 35%));
--button-primary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
--button-primary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 50%));
--button-secondary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
--button-secondary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
--button-secondary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
/* ---------active--------- */
--button-secondary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 25%));
--button-secondary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
--button-secondary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 15%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
/* ---------hover---------- */
--button-tertiary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
}

View File

@@ -0,0 +1,251 @@
/* Author: Yazan Agha-Schrader */
/* Inspiration from Nord Theme https://www.nordtheme.com/docs/colors-and-palettes */
.theme-snowstorm {
/* ---------- PRIMARY COLORS ----------------- */
--primary-color-1: hsl(217.5, 26.7%, 94.1%);
--primary-color-1-hue: 217.5;
--primary-color-1-saturation: 26.7%;
--primary-color-1-lightness: 94.1%;
--primary-color-2: hsl(218.2, 26.8%, 92.0%);
--primary-color-2-hue: 218.2;
--primary-color-2-saturation: 26.8%;
--primary-color-2-lightness: 92.0%;
--primary-color-3: hsl(218.8, 27.9%, 88.0%);
--primary-color-3-hue: 218.8;
--primary-color-3-saturation: 27.9%;
--primary-color-3-lightness: 88.0%;
--primary-color-4: hsl(218.8, 18.3%, 81.8%);
--primary-color-4-hue: 218.8;
--primary-color-4-saturation: 18.3%;
--primary-color-4-lightness: 81.8%;
/* ---------- SECONDARY COLORS --------------- */
--secondary-color-1: hsl(220.0, 16.4%, 21.6%);
--secondary-color-1-hue: 220.0;
--secondary-color-1-saturation: 16.4%;
--secondary-color-1-lightness: 21.6%;
--secondary-color-2: hsl(221.7, 16.3%, 27.6%);
--secondary-color-2-hue: 221.7;
--secondary-color-2-saturation: 16.3%;
--secondary-color-2-lightness: 27.6%;
--secondary-color-3: hsl(220.0, 16.8%, 31.6%);
--secondary-color-3-hue: 220.0;
--secondary-color-3-saturation: 16.8%;
--secondary-color-3-lightness: 31.6%;
--secondary-color-4: hsl(220.0, 16.5%, 35.7%);
--secondary-color-4-hue: 220.0;
--secondary-color-4-saturation: 16.5%;
--secondary-color-4-lightness: 35.7%;
/* ----------- NUANCES COLORS ---------------- */
--theme-nuance-color-1: hsl(178.7, 25.1%, 64.9%);
--theme-nuance-color-1-hue: 178.7;
--theme-nuance-color-1-saturation: 25.1%;
--theme-nuance-color-1-lightness: 64.9%;
--theme-nuance-color-2: hsl(193.3, 43.4%, 67.5%);
--theme-nuance-color-2-hue: 193.3;
--theme-nuance-color-2-saturation: 43.4%;
--theme-nuance-color-2-lightness: 67.5%;
--theme-nuance-color-3: hsl(210.0, 34.0%, 63.1%);
--theme-nuance-color-3-hue: 210.0;
--theme-nuance-color-3-saturation: 34.0%;
--theme-nuance-color-3-lightness: 63.1%;
--theme-nuance-color-4: hsl(213.1, 32.0%, 52.2%);
--theme-nuance-color-4-hue: 213.1;
--theme-nuance-color-4-saturation: 32.0%;
--theme-nuance-color-4-lightness: 52.2%;
/* ----------- ROYGP COLORS ------------------ */
--theme-red-color: hsl(32.5, 80%, 50%);
--theme-orange-color: hsl(32.5, 70%, 45%);
--theme-yellow-color: hsl(40.0, 0.6%, 73.3%);
--theme-green-color: hsl(92.4, 27.8%, 64.7%);
--theme-purple-color: hsl(311.1, 20.2%, 63.1%);
/* ------------------------------------------- */
--background-color-1: var(--primary-color-1);
--background-color-2: var(--primary-color-2);
--background-color-3: var(--primary-color-3);
--background-color-4: var(--primary-color-4);
--border-color-1: var(--primary-color-2);
--border-color-2: var(--primary-color-3);
--border-color-3: var(--primary-color-4);
--border-focus-color: var(--theme-nuance-color-2);
--border-focus-shadow: var(--theme-nuance-color-1);
--text-color-plain: var(--secondary-color-1);
--text-color-subtile-1: var(--secondary-color-2);
--text-color-subtile-2: var(--secondary-color-3);
--code-background-color: var(--secondary-color-2);
--code-text-color: var(--primary-color-2);
--ui-range-thumb-color: var(--theme-nuance-color-3);
--ui-range-thumb-border: var(--ui-ranger-thumb-color);
--textarea-border-color: var(--secondary-color-4);
--chat-id-color: var(--theme-nuance-color-4);
/* ------------------------------------------- */
--button-alert-text-hover: var(--primary-color-1);
--button-alert-color-hover: var(--theme-orange-color);
--button-alert-border-hover: var(--theme-orange-color);
--button-alert-text-active: var(--primary-color-1);
--button-alert-color-active: var(--theme-red-color);
--button-alert-border-active: var(--theme-red-color);
/* ----------- PRIMARY BUTTONS --------------- */
/* - button should immediately catch the eye - */
--button-primary-text: var(--secondary-color-1);
--button-primary-color: var(--theme-nuance-color-3);
--button-primary-border: var(--theme-nuance-color-3);
/* ---------hover---------- */
--button-primary-text-hover:
hsl(217.5,
calc(var(--secondary-color-1-saturation) + 35%),
calc(var(--secondary-color-1-lightness) - 30%));
--button-primary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
--button-primary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 2%),
calc(var(--theme-nuance-color-3-lightness) - 10%));
/* ---------active--------- */
--button-primary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 35%));
--button-primary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
--button-primary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 10%),
calc(var(--theme-nuance-color-3-lightness) - 25%));
/* ---------- SECONDARY BUTTONS -------------- */
/* these should NOT immediately catch the eye */
--button-secondary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 50%));
--button-secondary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
--button-secondary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) + 10%));
/* ---------hover---------- */
--button-secondary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 20%),
calc(var(--theme-nuance-color-3-lightness) - 80%));
--button-secondary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
--button-secondary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 22%),
calc(var(--theme-nuance-color-3-lightness) + 1%));
/* ---------active--------- */
--button-secondary-text-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) + 40%),
calc(var(--theme-nuance-color-3-lightness) - 55%));
--button-secondary-color-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-secondary-border-active:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 30%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
/* ---------- TERTIARY BUTTONS --------------- */
/* ---------- disabled buttons --------------- */
--button-tertiary-text:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
/* ---------hover---------- */
--button-tertiary-text-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) - 5%));
--button-tertiary-color-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
--button-tertiary-border-hover:
hsl(210,
calc(var(--theme-nuance-color-3-saturation) - 40%),
calc(var(--theme-nuance-color-3-lightness) + 20%));
}

View File

@@ -0,0 +1,266 @@
//@ts-check
// Helpers to work with different data types
// by Humans for All
//
/**
* Given the limited context size of local LLMs and , many a times when context gets filled
* between the prompt and the response, it can lead to repeating text garbage generation.
* And many a times setting penalty wrt repeatation leads to over-intelligent garbage
* repeatation with slight variations. These garbage inturn can lead to overloading of the
* available model context, leading to less valuable response for subsequent prompts/queries,
* if chat history is sent to ai model.
*
* So two simple minded garbage trimming logics are experimented below.
* * one based on progressively-larger-substring-based-repeat-matching-with-partial-skip and
* * another based on char-histogram-driven garbage trimming.
* * in future characteristic of histogram over varying lengths could be used to allow for
* a more aggressive and adaptive trimming logic.
*/
/**
* Simple minded logic to help remove repeating garbage at end of the string.
* The repeatation needs to be perfectly matching.
*
* The logic progressively goes on probing for longer and longer substring based
* repeatation, till there is no longer repeatation. Inturn picks the one with
* the longest chain.
*
* @param {string} sIn
* @param {number} maxSubL
* @param {number} maxMatchLenThreshold
*/
export function trim_repeat_garbage_at_end(sIn, maxSubL=10, maxMatchLenThreshold=40) {
let rCnt = [0];
let maxMatchLen = maxSubL;
let iMML = -1;
for(let subL=1; subL < maxSubL; subL++) {
rCnt.push(0);
let i;
let refS = sIn.substring(sIn.length-subL, sIn.length);
for(i=sIn.length; i > 0; i -= subL) {
let curS = sIn.substring(i-subL, i);
if (refS != curS) {
let curMatchLen = rCnt[subL]*subL;
if (maxMatchLen < curMatchLen) {
maxMatchLen = curMatchLen;
iMML = subL;
}
break;
}
rCnt[subL] += 1;
}
}
console.debug("DBUG:DU:TrimRepeatGarbage:", rCnt);
if ((iMML == -1) || (maxMatchLen < maxMatchLenThreshold)) {
return {trimmed: false, data: sIn};
}
console.debug("DBUG:TrimRepeatGarbage:TrimmedCharLen:", maxMatchLen);
let iEnd = sIn.length - maxMatchLen;
return { trimmed: true, data: sIn.substring(0, iEnd) };
}
/**
* Simple minded logic to help remove repeating garbage at end of the string, till it cant.
* If its not able to trim, then it will try to skip a char at end and then trim, a few times.
* This ensures that even if there are multiple runs of garbage with different patterns, the
* logic still tries to munch through them.
*
* @param {string} sIn
* @param {number} maxSubL
* @param {number | undefined} [maxMatchLenThreshold]
*/
export function trim_repeat_garbage_at_end_loop(sIn, maxSubL, maxMatchLenThreshold, skipMax=16) {
let sCur = sIn;
let sSaved = "";
let iTry = 0;
while(true) {
let got = trim_repeat_garbage_at_end(sCur, maxSubL, maxMatchLenThreshold);
if (got.trimmed != true) {
if (iTry == 0) {
sSaved = got.data;
}
iTry += 1;
if (iTry >= skipMax) {
return sSaved;
}
got.data = got.data.substring(0,got.data.length-1);
} else {
iTry = 0;
}
sCur = got.data;
}
}
/**
* A simple minded try trim garbage at end using histogram driven characteristics.
* There can be variation in the repeatations, as long as no new char props up.
*
* This tracks the chars and their frequency in a specified length of substring at the end
* and inturn checks if moving further into the generated text from the end remains within
* the same char subset or goes beyond it and based on that either trims the string at the
* end or not. This allows to filter garbage at the end, including even if there are certain
* kind of small variations in the repeated text wrt position of seen chars.
*
* Allow the garbage to contain upto maxUniq chars, but at the same time ensure that
* a given type of char ie numerals or alphabets or other types dont cross the specified
* maxType limit. This allows intermixed text garbage to be identified and trimmed.
*
* ALERT: This is not perfect and only provides a rough garbage identification logic.
* Also it currently only differentiates between character classes wrt english.
*
* @param {string} sIn
* @param {number} maxType
* @param {number} maxUniq
* @param {number} maxMatchLenThreshold
*/
export function trim_hist_garbage_at_end(sIn, maxType, maxUniq, maxMatchLenThreshold) {
if (sIn.length < maxMatchLenThreshold) {
return { trimmed: false, data: sIn };
}
let iAlp = 0;
let iNum = 0;
let iOth = 0;
// Learn
let hist = {};
let iUniq = 0;
for(let i=0; i<maxMatchLenThreshold; i++) {
let c = sIn[sIn.length-1-i];
if (c in hist) {
hist[c] += 1;
} else {
if(c.match(/[0-9]/) != null) {
iNum += 1;
} else if(c.match(/[A-Za-z]/) != null) {
iAlp += 1;
} else {
iOth += 1;
}
iUniq += 1;
if (iUniq >= maxUniq) {
break;
}
hist[c] = 1;
}
}
console.debug("DBUG:TrimHistGarbage:", hist);
if ((iAlp > maxType) || (iNum > maxType) || (iOth > maxType)) {
return { trimmed: false, data: sIn };
}
// Catch and Trim
for(let i=0; i < sIn.length; i++) {
let c = sIn[sIn.length-1-i];
if (!(c in hist)) {
if (i < maxMatchLenThreshold) {
return { trimmed: false, data: sIn };
}
console.debug("DBUG:TrimHistGarbage:TrimmedCharLen:", i);
return { trimmed: true, data: sIn.substring(0, sIn.length-i+1) };
}
}
console.debug("DBUG:TrimHistGarbage:Trimmed fully");
return { trimmed: true, data: "" };
}
/**
* Keep trimming repeatedly using hist_garbage logic, till you no longer can.
* This ensures that even if there are multiple runs of garbage with different patterns,
* the logic still tries to munch through them.
*
* @param {any} sIn
* @param {number} maxType
* @param {number} maxUniq
* @param {number} maxMatchLenThreshold
*/
export function trim_hist_garbage_at_end_loop(sIn, maxType, maxUniq, maxMatchLenThreshold) {
let sCur = sIn;
while (true) {
let got = trim_hist_garbage_at_end(sCur, maxType, maxUniq, maxMatchLenThreshold);
if (!got.trimmed) {
return got.data;
}
sCur = got.data;
}
}
/**
* Try trim garbage at the end by using both the hist-driven-garbage-trimming as well as
* skip-a-bit-if-reqd-then-repeat-pattern-based-garbage-trimming, with blind retrying.
* @param {string} sIn
*/
export function trim_garbage_at_end(sIn) {
let sCur = sIn;
for(let i=0; i<2; i++) {
sCur = trim_hist_garbage_at_end_loop(sCur, 8, 24, 72);
sCur = trim_repeat_garbage_at_end_loop(sCur, 32, 72, 12);
}
return sCur;
}
/**
* NewLines array helper.
* Allow for maintaining a list of lines.
* Allow for a line to be builtup/appended part by part.
*/
export class NewLines {
constructor() {
/** @type {string[]} */
this.lines = [];
}
/**
* Extracts lines from the passed string and inturn either
* append to a previous partial line or add a new line.
* @param {string} sLines
*/
add_append(sLines) {
let aLines = sLines.split("\n");
let lCnt = 0;
for(let line of aLines) {
lCnt += 1;
// Add back newline removed if any during split
if (lCnt < aLines.length) {
line += "\n";
} else {
if (sLines.endsWith("\n")) {
line += "\n";
}
}
// Append if required
if (lCnt == 1) {
let lastLine = this.lines[this.lines.length-1];
if (lastLine != undefined) {
if (!lastLine.endsWith("\n")) {
this.lines[this.lines.length-1] += line;
continue;
}
}
}
// Add new line
this.lines.push(line);
}
}
/**
* Shift the oldest/earliest/0th line in the array. [Old-New|Earliest-Latest]
* Optionally control whether only full lines (ie those with newline at end) will be returned
* or will a partial line without a newline at end (can only be the last line) be returned.
* @param {boolean} bFullWithNewLineOnly
*/
shift(bFullWithNewLineOnly=true) {
let line = this.lines[0];
if (line == undefined) {
return undefined;
}
if ((line[line.length-1] != "\n") && bFullWithNewLineOnly){
return undefined;
}
return this.lines.shift();
}
}

View File

@@ -8,21 +8,23 @@
<meta name="description" content="SimpleChat: trigger LLM web service endpoints /chat/completions and /completions, single/multi chat sessions" />
<meta name="author" content="by Humans for All" />
<meta http-equiv="Cache-Control" content="no-cache, no-store, must-revalidate" />
<script src="simplechat.js" defer></script>
<script type="importmap">
{
"imports": {
"datautils": "./datautils.mjs",
"ui": "./ui.mjs"
}
}
</script>
<script src="simplechat.js" type="module" defer></script>
<link rel="stylesheet" href="simplechat.css" />
</head>
<body>
<div class="samecolumn" id="fullbody">
<div class="sameline">
<div class="sameline" id="heading">
<p class="heading flex-grow" > <b> SimpleChat </b> </p>
<div class="sameline">
<label for="api-ep">Mode:</label>
<select name="api-ep" id="api-ep">
<option value="chat" selected>Chat</option>
<option value="completion">Completion</option>
</select>
</div>
<button id="settings">Settings</button>
</div>
<div id="sessions-div" class="sameline"></div>
@@ -30,7 +32,7 @@
<hr>
<div class="sameline">
<label for="system-in">System</label>
<input type="text" name="system" id="system-in" placeholder="e.g. you are a helpful ai assistant, who provides concise answers" class="flex-grow"/>
<textarea name="system" id="system-in" rows="2" placeholder="e.g. you are a helpful ai assistant, who provides concise answers" class="flex-grow"></textarea>
</div>
<hr>
@@ -40,7 +42,7 @@
<hr>
<div class="sameline">
<textarea id="user-in" class="flex-grow" rows="3" placeholder="enter your query to the ai model here" ></textarea>
<textarea id="user-in" class="flex-grow" rows="2" placeholder="enter your query to the ai model here" ></textarea>
<button id="user-btn">submit</button>
</div>

View File

@@ -11,18 +11,29 @@ in a simple way with minimal code from a common code base. Inturn additionally i
multiple independent back and forth chatting to an extent, with the ai llm model at a basic level, with their
own system prompts.
This allows seeing the generated text / ai-model response in oneshot at the end, after it is fully generated,
or potentially as it is being generated, in a streamed manner from the server/ai-model.
Auto saves the chat session locally as and when the chat is progressing and inturn at a later time when you
open SimpleChat, option is provided to restore the old chat session, if a matching one exists.
The UI follows a responsive web design so that the layout can adapt to available display space in a usable
enough manner, in general.
Allows developer/end-user to control some of the behaviour by updating gMe members from browser's devel-tool
console.
console. Parallely some of the directly useful to end-user settings can also be changed using the provided
settings ui.
NOTE: Given that the idea is for basic minimal testing, it doesnt bother with any model context length and
culling of old messages from the chat by default. However by enabling the sliding window chat logic, a crude
form of old messages culling can be achieved.
NOTE: Current web service api doesnt expose the model context length directly, so client logic doesnt provide
any adaptive culling of old messages nor of replacing them with summary of their content etal. However there
is a optional sliding window based chat logic, which provides a simple minded culling of old messages from
the chat history before sending to the ai model.
NOTE: It doesnt set any parameters other than temperature and max_tokens for now. However if someone wants
they can update the js file or equivalent member in gMe as needed.
NOTE: Wrt options sent with the request, it mainly sets temperature, max_tokens and optionaly stream for now.
However if someone wants they can update the js file or equivalent member in gMe as needed.
NOTE: One may be able to use this to chat with openai api web-service /chat/completions endpoint, in a very
limited / minimal way. One will need to set model, openai url and authorization bearer key in settings ui.
## usage
@@ -52,9 +63,15 @@ Open this simple web front end from your local browser
Once inside
* Select between chat and completion mode. By default it is set to chat mode.
* If you want to, you can change many of the default global settings
* the base url (ie ip addr / domain name, port)
* chat (default) vs completion mode
* try trim garbage in response or not
* amount of chat history in the context sent to server/ai-model
* oneshot or streamed mode.
* In completion mode
* one normally doesnt use a system prompt in completion mode.
* logic by default doesnt insert any role specific "ROLE: " prefix wrt each role's message.
If the model requires any prefix wrt user role messages, then the end user has to
explicitly add the needed prefix, when they enter their chat message.
@@ -88,12 +105,16 @@ Once inside
* Wait for the logic to communicate with the server and get the response.
* the user is not allowed to enter any fresh query during this time.
* the user input box will be disabled and a working message will be shown in it.
* if trim garbage is enabled, the logic will try to trim repeating text kind of garbage to some extent.
* just refresh the page, to reset wrt the chat history and or system prompt and start afresh.
* Using NewChat one can start independent chat sessions.
* two independent chat sessions are setup by default.
* When you want to print, switching ChatHistoryInCtxt to Full and clicking on the chat session button of
interest, will display the full chat history till then wrt same, if you want full history for printing.
## Devel note
@@ -104,14 +125,31 @@ by developers who may not be from web frontend background (so inturn may not be
end-use-specific-language-extensions driven flows) so that they can use it to explore/experiment things.
And given that the idea is also to help explore/experiment for developers, some flexibility is provided
to change behaviour easily using the devel-tools/console, for now. And skeletal logic has been implemented
to explore some of the end points and ideas/implications around them.
to change behaviour easily using the devel-tools/console or provided minimal settings ui (wrt few aspects).
Skeletal logic has been implemented to explore some of the end points and ideas/implications around them.
### General
Me/gMe consolidates the settings which control the behaviour into one object.
One can see the current settings, as well as change/update them using browsers devel-tool/console.
It is attached to the document object. Some of these can also be updated using the Settings UI.
baseURL - the domain-name/ip-address and inturn the port to send the request.
bStream - control between oneshot-at-end and live-stream-as-its-generated collating and showing
of the generated response.
the logic assumes that the text sent from the server follows utf-8 encoding.
in streaming mode - if there is any exception, the logic traps the same and tries to ensure
that text generated till then is not lost.
if a very long text is being generated, which leads to no user interaction for sometime and
inturn the machine goes into power saving mode or so, the platform may stop network connection,
leading to exception.
apiEP - select between /completions and /chat/completions endpoint provided by the server/ai-model.
bCompletionFreshChatAlways - whether Completion mode collates complete/sliding-window history when
communicating with the server or only sends the latest user query/message.
@@ -119,6 +157,19 @@ One can see the current settings, as well as change/update them using browsers d
bCompletionInsertStandardRolePrefix - whether Completion mode inserts role related prefix wrt the
messages that get inserted into prompt field wrt /Completion endpoint.
bTrimGarbage - whether garbage repeatation at the end of the generated ai response, should be
trimmed or left as is. If enabled, it will be trimmed so that it wont be sent back as part of
subsequent chat history. At the same time the actual trimmed text is shown to the user, once
when it was generated, so user can check if any useful info/data was there in the response.
One may be able to request the ai-model to continue (wrt the last response) (if chat-history
is enabled as part of the chat-history-in-context setting), and chances are the ai-model will
continue starting from the trimmed part, thus allows long response to be recovered/continued
indirectly, in many cases.
The histogram/freq based trimming logic is currently tuned for english language wrt its
is-it-a-alpabetic|numeral-char regex match logic.
chatRequestOptions - maintains the list of options/fields to send along with chat request,
irrespective of whether /chat/completions or /completions endpoint.
@@ -126,6 +177,14 @@ One can see the current settings, as well as change/update them using browsers d
modify the existing options value or remove them, for now you can update this global var
using browser's development-tools/console.
For string and numeric fields in chatRequestOptions, including even those added by a user
at runtime by directly modifying gMe.chatRequestOptions, setting ui entries will be auto
created.
headers - maintains the list of http headers sent when request is made to the server. By default
Content-Type is set to application/json. Additionally Authorization entry is provided, which can
be set if needed using the settings ui.
iRecentUserMsgCnt - a simple minded SlidingWindow to limit context window load at Ai Model end.
This is disabled by default. However if enabled, then in addition to latest system message, only
the last/latest iRecentUserMsgCnt user messages after the latest system prompt and its responses
@@ -140,7 +199,8 @@ One can see the current settings, as well as change/update them using browsers d
By using gMe's iRecentUserMsgCnt and chatRequestOptions.max_tokens one can try to control the
implications of loading of the ai-model's context window by chat history, wrt chat response to
some extent in a simple crude way.
some extent in a simple crude way. You may also want to control the context size enabled when
the server loads ai-model, on the server end.
Sometimes the browser may be stuborn with caching of the file, so your updates to html/css/js
@@ -149,28 +209,15 @@ matter clearing site data, dont directly override site caching in all cases. Wor
have to change port. Or in dev tools of browser, you may be able to disable caching fully.
Concept of multiple chat sessions with different servers, as well as saving and restoring of
those across browser usage sessions, can be woven around the SimpleChat/MultiChatUI class and
its instances relatively easily, however given the current goal of keeping this simple, it has
not been added, for now.
Currently the server to communicate with is maintained globally and not as part of a specific
chat session. So if one changes the server ip/url in setting, then all chat sessions will auto
switch to this new server, when you try using those sessions.
By switching between chat.add_system_begin/anytime, one can control whether one can change
the system prompt, anytime during the conversation or only at the beginning.
read_json_early, is to experiment with reading json response data early on, if available,
so that user can be shown generated data, as and when it is being generated, rather than
at the end when full data is available.
the server flow doesnt seem to be sending back data early, atleast for request (inc options)
that is currently sent.
if able to read json data early on in future, as and when ai model is generating data, then
this helper needs to indirectly update the chat div with the recieved data, without waiting
for the overall data to be available.
### Default setup
By default things are setup to try and make the user experience a bit better, if possible.
@@ -179,7 +226,8 @@ However a developer when testing the server of ai-model may want to change these
Using iRecentUserMsgCnt reduce chat history context sent to the server/ai-model to be
just the system-prompt, prev-user-request-and-ai-response and cur-user-request, instead of
full chat history. This way if there is any response with garbage/repeatation, it doesnt
mess with things beyond the next question/request/query, in some ways.
mess with things beyond the next question/request/query, in some ways. The trim garbage
option also tries to help avoid issues with garbage in the context to an extent.
Set max_tokens to 1024, so that a relatively large previous reponse doesnt eat up the space
available wrt next query-response. However dont forget that the server when started should
@@ -189,11 +237,33 @@ also be started with a model context size of 1k or more, to be on safe side.
internal n_predict, for now add the same here on the client side, maybe later add max_tokens
to /completions endpoint handling code on server side.
Frequency and presence penalty fields are set to 1.2 in the set of fields sent to server
along with the user query. So that the model is partly set to try avoid repeating text in
its response.
NOTE: One may want to experiment with frequency/presence penalty fields in chatRequestOptions
wrt the set of fields sent to server along with the user query. To check how the model behaves
wrt repeatations in general in the generated text response.
A end-user can change these behaviour by editing gMe from browser's devel-tool/console.
A end-user can change these behaviour by editing gMe from browser's devel-tool/console or by
using the providing settings ui.
### OpenAi / Equivalent API WebService
One may be abe to handshake with OpenAI/Equivalent api web service's /chat/completions endpoint
for a minimal chatting experimentation by setting the below.
* the baseUrl in settings ui
* https://api.openai.com/v1 or similar
* Wrt request body - gMe.chatRequestOptions
* model (settings ui)
* any additional fields if required in future
* Wrt request headers - gMe.headers
* Authorization (available through settings ui)
* Bearer THE_OPENAI_API_KEY
* any additional optional header entries like "OpenAI-Organization", "OpenAI-Project" or so
NOTE: Not tested, as there is no free tier api testing available. However logically this might
work.
## At the end

View File

@@ -21,6 +21,17 @@
.role-user {
background-color: lightgray;
}
.role-trim {
background-color: lightpink;
}
.gridx2 {
display: grid;
grid-template-columns: repeat(2, 1fr);
border-bottom-style: dotted;
border-bottom-width: thin;
border-bottom-color: lightblue;
}
.flex-grow {
flex-grow: 1;

View File

@@ -2,6 +2,9 @@
// A simple completions and chat/completions test related web front end logic
// by Humans for All
import * as du from "./datautils.mjs";
import * as ui from "./ui.mjs"
class Roles {
static System = "system";
static User = "user";
@@ -9,40 +12,65 @@ class Roles {
}
class ApiEP {
static Chat = "chat";
static Completion = "completion";
static Type = {
Chat: "chat",
Completion: "completion",
}
static UrlSuffix = {
'chat': `/chat/completions`,
'completion': `/completions`,
}
/**
* Build the url from given baseUrl and apiEp id.
* @param {string} baseUrl
* @param {string} apiEP
*/
static Url(baseUrl, apiEP) {
if (baseUrl.endsWith("/")) {
baseUrl = baseUrl.substring(0, baseUrl.length-1);
}
return `${baseUrl}${this.UrlSuffix[apiEP]}`;
}
}
let gUsageMsg = `
<p class="role-system">Usage</p>
<ul class="ul1">
<li> Set system prompt above, to try control ai response charactersitic, if model supports same.</li>
<li> System prompt above, to try control ai response characteristics.</li>
<ul class="ul2">
<li> Completion mode normally wont have a system prompt.</li>
<li> Completion mode - no system prompt normally.</li>
</ul>
<li> Use shift+enter for inserting enter/newline.</li>
<li> Enter your query to ai assistant below.</li>
<ul class="ul2">
<li> Completion mode doesnt insert user/role: prefix implicitly.</li>
<li> Use shift+enter for inserting enter/newline.</li>
</ul>
<li> Default ContextWindow = [System, Last Query+Resp, Cur Query].</li>
<ul class="ul2">
<li> experiment iRecentUserMsgCnt, max_tokens, model ctxt window to expand</li>
<li> ChatHistInCtxt, MaxTokens, ModelCtxt window to expand</li>
</ul>
</ul>
`;
/** @typedef {{role: string, content: string}[]} ChatMessages */
/** @typedef {{iLastSys: number, xchat: ChatMessages}} SimpleChatODS */
class SimpleChat {
constructor() {
/**
* @param {string} chatId
*/
constructor(chatId) {
this.chatId = chatId;
/**
* Maintain in a form suitable for common LLM web service chat/completions' messages entry
* @type {ChatMessages}
*/
this.xchat = [];
this.iLastSys = -1;
this.latestResponse = "";
}
clear() {
@@ -50,6 +78,27 @@ class SimpleChat {
this.iLastSys = -1;
}
ods_key() {
return `SimpleChat-${this.chatId}`
}
save() {
/** @type {SimpleChatODS} */
let ods = {iLastSys: this.iLastSys, xchat: this.xchat};
localStorage.setItem(this.ods_key(), JSON.stringify(ods));
}
load() {
let sods = localStorage.getItem(this.ods_key());
if (sods == null) {
return;
}
/** @type {SimpleChatODS} */
let ods = JSON.parse(sods);
this.iLastSys = ods.iLastSys;
this.xchat = ods.xchat;
}
/**
* Recent chat messages.
* If iRecentUserMsgCnt < 0
@@ -94,6 +143,15 @@ class SimpleChat {
return rchat;
}
/**
* Collate the latest response from the server/ai-model, as it is becoming available.
* This is mainly useful for the stream mode.
* @param {string} content
*/
append_response(content) {
this.latestResponse += content;
}
/**
* Add an entry into xchat
* @param {string} role
@@ -107,6 +165,7 @@ class SimpleChat {
if (role == Roles.System) {
this.iLastSys = this.xchat.length - 1;
}
this.save();
return true;
}
@@ -121,10 +180,8 @@ class SimpleChat {
}
let last = undefined;
for(const x of this.recent_chat(gMe.iRecentUserMsgCnt)) {
let entry = document.createElement("p");
let entry = ui.el_create_append_p(`${x.role}: ${x.content}`, div);
entry.className = `role-${x.role}`;
entry.innerText = `${x.role}: ${x.content}`;
div.appendChild(entry);
last = entry;
}
if (last !== undefined) {
@@ -132,21 +189,45 @@ class SimpleChat {
} else {
if (bClear) {
div.innerHTML = gUsageMsg;
gMe.setup_load(div, this);
gMe.show_info(div);
}
}
return last;
}
/**
* Setup the fetch headers.
* It picks the headers from gMe.headers.
* It inserts Authorization only if its non-empty.
* @param {string} apiEP
*/
fetch_headers(apiEP) {
let headers = new Headers();
for(let k in gMe.headers) {
let v = gMe.headers[k];
if ((k == "Authorization") && (v.trim() == "")) {
continue;
}
headers.append(k, v);
}
return headers;
}
/**
* Add needed fields wrt json object to be sent wrt LLM web services completions endpoint.
* The needed fields/options are picked from a global object.
* Add optional stream flag, if required.
* Convert the json into string.
* @param {Object} obj
*/
request_jsonstr(obj) {
request_jsonstr_extend(obj) {
for(let k in gMe.chatRequestOptions) {
obj[k] = gMe.chatRequestOptions[k];
}
if (gMe.bStream) {
obj["stream"] = true;
}
return JSON.stringify(obj);
}
@@ -157,7 +238,7 @@ class SimpleChat {
let req = {
messages: this.recent_chat(gMe.iRecentUserMsgCnt),
}
return this.request_jsonstr(req);
return this.request_jsonstr_extend(req);
}
/**
@@ -180,7 +261,60 @@ class SimpleChat {
let req = {
prompt: prompt,
}
return this.request_jsonstr(req);
return this.request_jsonstr_extend(req);
}
/**
* Return a string form of json object suitable for specified api endpoint.
* @param {string} apiEP
*/
request_jsonstr(apiEP) {
if (apiEP == ApiEP.Type.Chat) {
return this.request_messages_jsonstr();
} else {
return this.request_prompt_jsonstr(gMe.bCompletionInsertStandardRolePrefix);
}
}
/**
* Extract the ai-model/assistant's response from the http response got.
* Optionally trim the message wrt any garbage at the end.
* @param {any} respBody
* @param {string} apiEP
*/
response_extract(respBody, apiEP) {
let assistant = "";
if (apiEP == ApiEP.Type.Chat) {
assistant = respBody["choices"][0]["message"]["content"];
} else {
try {
assistant = respBody["choices"][0]["text"];
} catch {
assistant = respBody["content"];
}
}
return assistant;
}
/**
* Extract the ai-model/assistant's response from the http response got in streaming mode.
* @param {any} respBody
* @param {string} apiEP
*/
response_extract_stream(respBody, apiEP) {
let assistant = "";
if (apiEP == ApiEP.Type.Chat) {
if (respBody["choices"][0]["finish_reason"] !== "stop") {
assistant = respBody["choices"][0]["delta"]["content"];
}
} else {
try {
assistant = respBody["choices"][0]["text"];
} catch {
assistant = respBody["content"];
}
}
return assistant;
}
/**
@@ -239,53 +373,99 @@ class SimpleChat {
return sysPrompt;
}
}
let gBaseURL = "http://127.0.0.1:8080";
let gChatURL = {
'chat': `${gBaseURL}/chat/completions`,
'completion': `${gBaseURL}/completions`,
}
/**
* Set the class of the children, based on whether it is the idSelected or not.
* @param {HTMLDivElement} elBase
* @param {string} idSelected
* @param {string} classSelected
* @param {string} classUnSelected
*/
function el_children_config_class(elBase, idSelected, classSelected, classUnSelected="") {
for(let child of elBase.children) {
if (child.id == idSelected) {
child.className = classSelected;
} else {
child.className = classUnSelected;
/**
* Handle the multipart response from server/ai-model
* @param {Response} resp
* @param {string} apiEP
* @param {HTMLDivElement} elDiv
*/
async handle_response_multipart(resp, apiEP, elDiv) {
let elP = ui.el_create_append_p("", elDiv);
if (!resp.body) {
throw Error("ERRR:SimpleChat:SC:HandleResponseMultiPart:No body...");
}
let tdUtf8 = new TextDecoder("utf-8");
let rr = resp.body.getReader();
this.latestResponse = "";
let xLines = new du.NewLines();
while(true) {
let { value: cur, done: done } = await rr.read();
if (cur) {
let curBody = tdUtf8.decode(cur, {stream: true});
console.debug("DBUG:SC:PART:Str:", curBody);
xLines.add_append(curBody);
}
while(true) {
let curLine = xLines.shift(!done);
if (curLine == undefined) {
break;
}
if (curLine.trim() == "") {
continue;
}
if (curLine.startsWith("data:")) {
curLine = curLine.substring(5);
}
let curJson = JSON.parse(curLine);
console.debug("DBUG:SC:PART:Json:", curJson);
this.append_response(this.response_extract_stream(curJson, apiEP));
}
elP.innerText = this.latestResponse;
elP.scrollIntoView(false);
if (done) {
break;
}
}
console.debug("DBUG:SC:PART:Full:", this.latestResponse);
return this.latestResponse;
}
}
/**
* Create button and set it up.
* @param {string} id
* @param {(this: HTMLButtonElement, ev: MouseEvent) => any} callback
* @param {string | undefined} name
* @param {string | undefined} innerText
*/
function el_create_button(id, callback, name=undefined, innerText=undefined) {
if (!name) {
name = id;
/**
* Handle the oneshot response from server/ai-model
* @param {Response} resp
* @param {string} apiEP
*/
async handle_response_oneshot(resp, apiEP) {
let respBody = await resp.json();
console.debug(`DBUG:SimpleChat:SC:${this.chatId}:HandleUserSubmit:RespBody:${JSON.stringify(respBody)}`);
return this.response_extract(respBody, apiEP);
}
if (!innerText) {
innerText = id;
/**
* Handle the response from the server be it in oneshot or multipart/stream mode.
* Also take care of the optional garbage trimming.
* @param {Response} resp
* @param {string} apiEP
* @param {HTMLDivElement} elDiv
*/
async handle_response(resp, apiEP, elDiv) {
let theResp = {
assistant: "",
trimmed: "",
}
if (gMe.bStream) {
try {
theResp.assistant = await this.handle_response_multipart(resp, apiEP, elDiv);
this.latestResponse = "";
} catch (error) {
theResp.assistant = this.latestResponse;
this.add(Roles.Assistant, theResp.assistant);
this.latestResponse = "";
throw error;
}
} else {
theResp.assistant = await this.handle_response_oneshot(resp, apiEP);
}
if (gMe.bTrimGarbage) {
let origMsg = theResp.assistant;
theResp.assistant = du.trim_garbage_at_end(origMsg);
theResp.trimmed = origMsg.substring(theResp.assistant.length);
}
this.add(Roles.Assistant, theResp.assistant);
return theResp;
}
let btn = document.createElement("button");
btn.id = id;
btn.name = name;
btn.innerText = innerText;
btn.addEventListener("click", callback);
return btn;
}
@@ -302,14 +482,16 @@ class MultiChatUI {
this.elDivChat = /** @type{HTMLDivElement} */(document.getElementById("chat-div"));
this.elBtnUser = /** @type{HTMLButtonElement} */(document.getElementById("user-btn"));
this.elInUser = /** @type{HTMLInputElement} */(document.getElementById("user-in"));
this.elSelectApiEP = /** @type{HTMLSelectElement} */(document.getElementById("api-ep"));
this.elDivHeading = /** @type{HTMLSelectElement} */(document.getElementById("heading"));
this.elDivSessions = /** @type{HTMLDivElement} */(document.getElementById("sessions-div"));
this.elBtnSettings = /** @type{HTMLButtonElement} */(document.getElementById("settings"));
this.validate_element(this.elInSystem, "system-in");
this.validate_element(this.elDivChat, "chat-div");
this.validate_element(this.elInUser, "user-in");
this.validate_element(this.elSelectApiEP, "api-ep");
this.validate_element(this.elDivHeading, "heading");
this.validate_element(this.elDivChat, "sessions-div");
this.validate_element(this.elBtnSettings, "settings");
}
/**
@@ -350,13 +532,18 @@ class MultiChatUI {
this.handle_session_switch(this.curChatId);
}
this.elBtnSettings.addEventListener("click", (ev)=>{
this.elDivChat.replaceChildren();
gMe.show_settings(this.elDivChat);
});
this.elBtnUser.addEventListener("click", (ev)=>{
if (this.elInUser.disabled) {
return;
}
this.handle_user_submit(this.curChatId, this.elSelectApiEP.value).catch((/** @type{Error} */reason)=>{
this.handle_user_submit(this.curChatId, gMe.apiEP).catch((/** @type{Error} */reason)=>{
let msg = `ERRR:SimpleChat\nMCUI:HandleUserSubmit:${this.curChatId}\n${reason.name}:${reason.message}`;
console.debug(msg.replace("\n", ":"));
console.error(msg.replace("\n", ":"));
alert(msg);
this.ui_reset_userinput();
});
@@ -377,6 +564,8 @@ class MultiChatUI {
// allow user to insert enter into the system prompt using shift+enter.
// while just pressing enter key will lead to setting the system prompt.
if ((ev.key === "Enter") && (!ev.shiftKey)) {
let value = this.elInSystem.value;
this.elInSystem.value = value.substring(0,value.length-1);
let chat = this.simpleChats[this.curChatId];
chat.add_system_anytime(this.elInSystem.value, this.curChatId);
chat.show(this.elDivChat);
@@ -392,34 +581,12 @@ class MultiChatUI {
* @param {boolean} bSwitchSession
*/
new_chat_session(chatId, bSwitchSession=false) {
this.simpleChats[chatId] = new SimpleChat();
this.simpleChats[chatId] = new SimpleChat(chatId);
if (bSwitchSession) {
this.handle_session_switch(chatId);
}
}
/**
* Try read json response early, if available.
* @param {Response} resp
*/
async read_json_early(resp) {
if (!resp.body) {
throw Error("ERRR:SimpleChat:MCUI:ReadJsonEarly:No body...");
}
let tdUtf8 = new TextDecoder("utf-8");
let rr = resp.body.getReader();
let gotBody = "";
while(true) {
let { value: cur, done: done} = await rr.read();
let curBody = tdUtf8.decode(cur);
console.debug("DBUG:SC:PART:", curBody);
gotBody += curBody;
if (done) {
break;
}
}
return JSON.parse(gotBody);
}
/**
* Handle user query submit request, wrt specified chat session.
@@ -434,7 +601,7 @@ class MultiChatUI {
// So if user wants to simulate a multi-chat based completion query,
// they will have to enter the full thing, as a suitable multiline
// user input/query.
if ((apiEP == ApiEP.Completion) && (gMe.bCompletionFreshChatAlways)) {
if ((apiEP == ApiEP.Type.Completion) && (gMe.bCompletionFreshChatAlways)) {
chat.clear();
}
@@ -447,41 +614,26 @@ class MultiChatUI {
}
chat.show(this.elDivChat);
let theBody;
let theUrl = gChatURL[apiEP]
if (apiEP == ApiEP.Chat) {
theBody = chat.request_messages_jsonstr();
} else {
theBody = chat.request_prompt_jsonstr(gMe.bCompletionInsertStandardRolePrefix);
}
let theUrl = ApiEP.Url(gMe.baseURL, apiEP);
let theBody = chat.request_jsonstr(apiEP);
this.elInUser.value = "working...";
this.elInUser.disabled = true;
console.debug(`DBUG:SimpleChat:MCUI:${chatId}:HandleUserSubmit:${theUrl}:ReqBody:${theBody}`);
let theHeaders = chat.fetch_headers(apiEP);
let resp = await fetch(theUrl, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
headers: theHeaders,
body: theBody,
});
let respBody = await resp.json();
//let respBody = await this.read_json_early(resp);
console.debug(`DBUG:SimpleChat:MCUI:${chatId}:HandleUserSubmit:RespBody:${JSON.stringify(respBody)}`);
let assistantMsg;
if (apiEP == ApiEP.Chat) {
assistantMsg = respBody["choices"][0]["message"]["content"];
} else {
try {
assistantMsg = respBody["choices"][0]["text"];
} catch {
assistantMsg = respBody["content"];
}
}
chat.add(Roles.Assistant, assistantMsg);
let theResp = await chat.handle_response(resp, apiEP, this.elDivChat);
if (chatId == this.curChatId) {
chat.show(this.elDivChat);
if (theResp.trimmed.length > 0) {
let p = ui.el_create_append_p(`TRIMMED:${theResp.trimmed}`, this.elDivChat);
p.className="role-trim";
}
} else {
console.debug(`DBUG:SimpleChat:MCUI:HandleUserSubmit:ChatId has changed:[${chatId}] [${this.curChatId}]`);
}
@@ -500,7 +652,7 @@ class MultiChatUI {
}
elDiv.replaceChildren();
// Btn for creating new chat session
let btnNew = el_create_button("New CHAT", (ev)=> {
let btnNew = ui.el_create_button("New CHAT", (ev)=> {
if (this.elInUser.disabled) {
console.error(`ERRR:SimpleChat:MCUI:NewChat:Current session [${this.curChatId}] awaiting response, ignoring request...`);
alert("ERRR:SimpleChat\nMCUI:NewChat\nWait for response to pending query, before starting new chat session");
@@ -514,7 +666,7 @@ class MultiChatUI {
}
this.new_chat_session(chatIdGot, true);
this.create_session_btn(elDiv, chatIdGot);
el_children_config_class(elDiv, chatIdGot, "session-selected", "");
ui.el_children_config_class(elDiv, chatIdGot, "session-selected", "");
});
elDiv.appendChild(btnNew);
// Btns for existing chat sessions
@@ -528,7 +680,7 @@ class MultiChatUI {
}
create_session_btn(elDiv, cid) {
let btn = el_create_button(cid, (ev)=>{
let btn = ui.el_create_button(cid, (ev)=>{
let target = /** @type{HTMLButtonElement} */(ev.target);
console.debug(`DBUG:SimpleChat:MCUI:SessionClick:${target.id}`);
if (this.elInUser.disabled) {
@@ -537,7 +689,7 @@ class MultiChatUI {
return;
}
this.handle_session_switch(target.id);
el_children_config_class(elDiv, target.id, "session-selected", "");
ui.el_children_config_class(elDiv, target.id, "session-selected", "");
});
elDiv.appendChild(btn);
return btn;
@@ -567,46 +719,183 @@ class MultiChatUI {
class Me {
constructor() {
this.baseURL = "http://127.0.0.1:8080";
this.defaultChatIds = [ "Default", "Other" ];
this.multiChat = new MultiChatUI();
this.bStream = true;
this.bCompletionFreshChatAlways = true;
this.bCompletionInsertStandardRolePrefix = false;
this.bTrimGarbage = true;
this.iRecentUserMsgCnt = 2;
this.sRecentUserMsgCnt = {
"Full": -1,
"Last0": 1,
"Last1": 2,
"Last2": 3,
"Last4": 5,
};
this.apiEP = ApiEP.Type.Chat;
this.headers = {
"Content-Type": "application/json",
"Authorization": "", // Authorization: Bearer OPENAI_API_KEY
}
// Add needed fields wrt json object to be sent wrt LLM web services completions endpoint.
this.chatRequestOptions = {
"model": "gpt-3.5-turbo",
"temperature": 0.7,
"max_tokens": 1024,
"frequency_penalty": 1.2,
"presence_penalty": 1.2,
"n_predict": 1024
"n_predict": 1024,
//"frequency_penalty": 1.2,
//"presence_penalty": 1.2,
};
}
/**
* Disable console.debug by mapping it to a empty function.
*/
debug_disable() {
this.console_debug = console.debug;
console.debug = () => {
};
}
/**
* Setup the load saved chat ui.
* @param {HTMLDivElement} div
* @param {SimpleChat} chat
*/
setup_load(div, chat) {
if (!(chat.ods_key() in localStorage)) {
return;
}
div.innerHTML += `<p class="role-system">Restore</p>
<p>Load previously saved chat session, if available</p>`;
let btn = ui.el_create_button(chat.ods_key(), (ev)=>{
console.log("DBUG:SimpleChat:SC:Load", chat);
chat.load();
queueMicrotask(()=>{
chat.show(div);
this.multiChat.elInSystem.value = chat.get_system_latest();
});
});
div.appendChild(btn);
}
/**
* Show the configurable parameters info in the passed Div element.
* @param {HTMLDivElement} elDiv
* @param {boolean} bAll
*/
show_info(elDiv, bAll=false) {
let p = ui.el_create_append_p("Settings (devel-tools-console document[gMe])", elDiv);
p.className = "role-system";
if (bAll) {
ui.el_create_append_p(`baseURL:${this.baseURL}`, elDiv);
ui.el_create_append_p(`Authorization:${this.headers["Authorization"]}`, elDiv);
ui.el_create_append_p(`bStream:${this.bStream}`, elDiv);
ui.el_create_append_p(`bCompletionFreshChatAlways:${this.bCompletionFreshChatAlways}`, elDiv);
ui.el_create_append_p(`bCompletionInsertStandardRolePrefix:${this.bCompletionInsertStandardRolePrefix}`, elDiv);
ui.el_create_append_p(`bTrimGarbage:${this.bTrimGarbage}`, elDiv);
ui.el_create_append_p(`iRecentUserMsgCnt:${this.iRecentUserMsgCnt}`, elDiv);
ui.el_create_append_p(`ApiEndPoint:${this.apiEP}`, elDiv);
}
ui.el_create_append_p(`chatRequestOptions:${JSON.stringify(this.chatRequestOptions, null, " - ")}`, elDiv);
ui.el_create_append_p(`headers:${JSON.stringify(this.headers, null, " - ")}`, elDiv);
}
/**
* Auto create ui input elements for fields in ChatRequestOptions
* Currently supports text and number field types.
* @param {HTMLDivElement} elDiv
*/
show_info(elDiv) {
show_settings_chatrequestoptions(elDiv) {
let typeDict = {
"string": "text",
"number": "number",
};
let fs = document.createElement("fieldset");
let legend = document.createElement("legend");
legend.innerText = "ChatRequestOptions";
fs.appendChild(legend);
elDiv.appendChild(fs);
for(const k in this.chatRequestOptions) {
let val = this.chatRequestOptions[k];
let type = typeof(val);
if (!((type == "string") || (type == "number"))) {
continue;
}
let inp = ui.el_creatediv_input(`Set${k}`, k, typeDict[type], this.chatRequestOptions[k], (val)=>{
if (type == "number") {
val = Number(val);
}
this.chatRequestOptions[k] = val;
});
fs.appendChild(inp.div);
}
}
var p = document.createElement("p");
p.innerText = "Settings (devel-tools-console gMe)";
p.className = "role-system";
elDiv.appendChild(p);
/**
* Show settings ui for configurable parameters, in the passed Div element.
* @param {HTMLDivElement} elDiv
*/
show_settings(elDiv) {
var p = document.createElement("p");
p.innerText = `bCompletionFreshChatAlways:${this.bCompletionFreshChatAlways}`;
elDiv.appendChild(p);
let inp = ui.el_creatediv_input("SetBaseURL", "BaseURL", "text", this.baseURL, (val)=>{
this.baseURL = val;
});
elDiv.appendChild(inp.div);
p = document.createElement("p");
p.innerText = `bCompletionInsertStandardRolePrefix:${this.bCompletionInsertStandardRolePrefix}`;
elDiv.appendChild(p);
inp = ui.el_creatediv_input("SetAuthorization", "Authorization", "text", this.headers["Authorization"], (val)=>{
this.headers["Authorization"] = val;
});
inp.el.placeholder = "Bearer OPENAI_API_KEY";
elDiv.appendChild(inp.div);
p = document.createElement("p");
p.innerText = `iRecentUserMsgCnt:${this.iRecentUserMsgCnt}`;
elDiv.appendChild(p);
let bb = ui.el_creatediv_boolbutton("SetStream", "Stream", {true: "[+] yes stream", false: "[-] do oneshot"}, this.bStream, (val)=>{
this.bStream = val;
});
elDiv.appendChild(bb.div);
p = document.createElement("p");
p.innerText = `chatRequestOptions:${JSON.stringify(this.chatRequestOptions)}`;
elDiv.appendChild(p);
bb = ui.el_creatediv_boolbutton("SetCompletionFreshChatAlways", "CompletionFreshChatAlways", {true: "[+] yes fresh", false: "[-] no, with history"}, this.bCompletionFreshChatAlways, (val)=>{
this.bCompletionFreshChatAlways = val;
});
elDiv.appendChild(bb.div);
bb = ui.el_creatediv_boolbutton("SetCompletionInsertStandardRolePrefix", "CompletionInsertStandardRolePrefix", {true: "[+] yes insert", false: "[-] dont insert"}, this.bCompletionInsertStandardRolePrefix, (val)=>{
this.bCompletionInsertStandardRolePrefix = val;
});
elDiv.appendChild(bb.div);
bb = ui.el_creatediv_boolbutton("SetTrimGarbage", "TrimGarbage", {true: "[+] yes trim", false: "[-] dont trim"}, this.bTrimGarbage, (val)=>{
this.bTrimGarbage = val;
});
elDiv.appendChild(bb.div);
let sel = ui.el_creatediv_select("SetChatHistoryInCtxt", "ChatHistoryInCtxt", this.sRecentUserMsgCnt, this.iRecentUserMsgCnt, (val)=>{
this.iRecentUserMsgCnt = this.sRecentUserMsgCnt[val];
});
elDiv.appendChild(sel.div);
sel = ui.el_creatediv_select("SetApiEP", "ApiEndPoint", ApiEP.Type, this.apiEP, (val)=>{
this.apiEP = ApiEP.Type[val];
});
elDiv.appendChild(sel.div);
this.show_settings_chatrequestoptions(elDiv);
}
@@ -619,6 +908,9 @@ let gMe;
function startme() {
console.log("INFO:SimpleChat:StartMe:Starting...");
gMe = new Me();
gMe.debug_disable();
document["gMe"] = gMe;
document["du"] = du;
for (let cid of gMe.defaultChatIds) {
gMe.multiChat.new_chat_session(cid);
}

View File

@@ -0,0 +1,211 @@
//@ts-check
// Helpers to work with html elements
// by Humans for All
//
/**
* Set the class of the children, based on whether it is the idSelected or not.
* @param {HTMLDivElement} elBase
* @param {string} idSelected
* @param {string} classSelected
* @param {string} classUnSelected
*/
export function el_children_config_class(elBase, idSelected, classSelected, classUnSelected="") {
for(let child of elBase.children) {
if (child.id == idSelected) {
child.className = classSelected;
} else {
child.className = classUnSelected;
}
}
}
/**
* Create button and set it up.
* @param {string} id
* @param {(this: HTMLButtonElement, ev: MouseEvent) => any} callback
* @param {string | undefined} name
* @param {string | undefined} innerText
*/
export function el_create_button(id, callback, name=undefined, innerText=undefined) {
if (!name) {
name = id;
}
if (!innerText) {
innerText = id;
}
let btn = document.createElement("button");
btn.id = id;
btn.name = name;
btn.innerText = innerText;
btn.addEventListener("click", callback);
return btn;
}
/**
* Create a para and set it up. Optionaly append it to a passed parent.
* @param {string} text
* @param {HTMLElement | undefined} elParent
* @param {string | undefined} id
*/
export function el_create_append_p(text, elParent=undefined, id=undefined) {
let para = document.createElement("p");
para.innerText = text;
if (id) {
para.id = id;
}
if (elParent) {
elParent.appendChild(para);
}
return para;
}
/**
* Create a button which represents bool value using specified text wrt true and false.
* When ever user clicks the button, it will toggle the value and update the shown text.
*
* @param {string} id
* @param {{true: string, false: string}} texts
* @param {boolean} defaultValue
* @param {function(boolean):void} cb
*/
export function el_create_boolbutton(id, texts, defaultValue, cb) {
let el = document.createElement("button");
el["xbool"] = defaultValue;
el["xtexts"] = structuredClone(texts);
el.innerText = el["xtexts"][String(defaultValue)];
if (id) {
el.id = id;
}
el.addEventListener('click', (ev)=>{
el["xbool"] = !el["xbool"];
el.innerText = el["xtexts"][String(el["xbool"])];
cb(el["xbool"]);
})
return el;
}
/**
* Create a div wrapped button which represents bool value using specified text wrt true and false.
* @param {string} id
* @param {string} label
* @param {{ true: string; false: string; }} texts
* @param {boolean} defaultValue
* @param {(arg0: boolean) => void} cb
* @param {string} className
*/
export function el_creatediv_boolbutton(id, label, texts, defaultValue, cb, className="gridx2") {
let div = document.createElement("div");
div.className = className;
let lbl = document.createElement("label");
lbl.setAttribute("for", id);
lbl.innerText = label;
div.appendChild(lbl);
let btn = el_create_boolbutton(id, texts, defaultValue, cb);
div.appendChild(btn);
return { div: div, el: btn };
}
/**
* Create a select ui element, with a set of options to select from.
* * options: an object which contains name-value pairs
* * defaultOption: the value whose name should be choosen, by default.
* * cb : the call back returns the name string of the option selected.
*
* @param {string} id
* @param {Object<string,*>} options
* @param {*} defaultOption
* @param {function(string):void} cb
*/
export function el_create_select(id, options, defaultOption, cb) {
let el = document.createElement("select");
el["xselected"] = defaultOption;
el["xoptions"] = structuredClone(options);
for(let cur of Object.keys(options)) {
let op = document.createElement("option");
op.value = cur;
op.innerText = cur;
if (options[cur] == defaultOption) {
op.selected = true;
}
el.appendChild(op);
}
if (id) {
el.id = id;
el.name = id;
}
el.addEventListener('change', (ev)=>{
let target = /** @type{HTMLSelectElement} */(ev.target);
console.log("DBUG:UI:Select:", id, ":", target.value);
cb(target.value);
})
return el;
}
/**
* Create a div wrapped select ui element, with a set of options to select from.
*
* @param {string} id
* @param {any} label
* @param {{ [x: string]: any; }} options
* @param {any} defaultOption
* @param {(arg0: string) => void} cb
* @param {string} className
*/
export function el_creatediv_select(id, label, options, defaultOption, cb, className="gridx2") {
let div = document.createElement("div");
div.className = className;
let lbl = document.createElement("label");
lbl.setAttribute("for", id);
lbl.innerText = label;
div.appendChild(lbl);
let sel = el_create_select(id, options,defaultOption, cb);
div.appendChild(sel);
return { div: div, el: sel };
}
/**
* Create a input ui element.
*
* @param {string} id
* @param {string} type
* @param {any} defaultValue
* @param {function(any):void} cb
*/
export function el_create_input(id, type, defaultValue, cb) {
let el = document.createElement("input");
el.type = type;
el.value = defaultValue;
if (id) {
el.id = id;
}
el.addEventListener('change', (ev)=>{
cb(el.value);
})
return el;
}
/**
* Create a div wrapped input.
*
* @param {string} id
* @param {string} label
* @param {string} type
* @param {any} defaultValue
* @param {function(any):void} cb
* @param {string} className
*/
export function el_creatediv_input(id, label, type, defaultValue, cb, className="gridx2") {
let div = document.createElement("div");
div.className = className;
let lbl = document.createElement("label");
lbl.setAttribute("for", id);
lbl.innerText = label;
div.appendChild(lbl);
let el = el_create_input(id, type, defaultValue, cb);
div.appendChild(el);
return { div: div, el: el };
}

File diff suppressed because it is too large Load Diff

View File

@@ -116,13 +116,6 @@ static inline void server_log(const char * level, const char * function, int lin
// chat template utils
//
// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
inline bool verify_custom_template(const std::string & tmpl) {
llama_chat_message chat[] = {{"user", "test"}};
int res = llama_chat_apply_template(nullptr, tmpl.c_str(), chat, 1, true, nullptr, 0);
return res >= 0;
}
// Format given chat. If tmpl is empty, we take the template from model metadata
inline std::string format_chat(const struct llama_model * model, const std::string & tmpl, const std::vector<json> & messages) {
size_t alloc_size = 0;
@@ -260,6 +253,13 @@ static size_t common_part(const std::vector<llama_token> & a, const std::vector<
return i;
}
static size_t common_part(const std::string & a, const std::string & b) {
size_t i;
for (i = 0; i < a.size() && i < b.size() && a[i] == b[i]; i++) {}
return i;
}
static bool ends_with(const std::string & str, const std::string & suffix) {
return str.size() >= suffix.size() && 0 == str.compare(str.size() - suffix.size(), suffix.size(), suffix);
}

View File

@@ -3,7 +3,7 @@
The purpose of this example is to demonstrate a minimal usage of llama.cpp for generating text with a given prompt.
```bash
./simple ./models/llama-7b-v2/ggml-model-f16.gguf "Hello my name is"
./simple -m ./models/llama-7b-v2/ggml-model-f16.gguf -p "Hello my name is"
...

View File

@@ -6,28 +6,27 @@
#include <string>
#include <vector>
static void print_usage(int argc, char ** argv, const gpt_params & params) {
gpt_params_print_usage(argc, argv, params);
LOG_TEE("\nexample usage:\n");
LOG_TEE("\n %s -m model.gguf -p \"Hello my name is\" -n 32\n", argv[0]);
LOG_TEE("\n");
}
int main(int argc, char ** argv) {
gpt_params params;
if (argc == 1 || argv[1][0] == '-') {
printf("usage: %s MODEL_PATH [PROMPT]\n" , argv[0]);
return 1 ;
}
params.prompt = "Hello my name is";
params.n_predict = 32;
if (argc >= 2) {
params.model = argv[1];
}
if (argc >= 3) {
params.prompt = argv[2];
}
if (params.prompt.empty()) {
params.prompt = "Hello my name is";
if (!gpt_params_parse(argc, argv, params)) {
print_usage(argc, argv, params);
return 1;
}
// total length of the sequence including the prompt
const int n_len = 32;
const int n_predict = params.n_predict;
// init LLM
@@ -36,9 +35,7 @@ int main(int argc, char ** argv) {
// initialize the model
llama_model_params model_params = llama_model_default_params();
// model_params.n_gpu_layers = 99; // offload all layers to the GPU
llama_model_params model_params = llama_model_params_from_gpt_params(params);
llama_model * model = llama_load_model_from_file(params.model.c_str(), model_params);
@@ -49,12 +46,7 @@ int main(int argc, char ** argv) {
// initialize the context
llama_context_params ctx_params = llama_context_default_params();
ctx_params.seed = 1234;
ctx_params.n_ctx = 2048;
ctx_params.n_threads = params.n_threads;
ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
llama_context_params ctx_params = llama_context_params_from_gpt_params(params);
llama_context * ctx = llama_new_context_with_model(model, ctx_params);
@@ -69,14 +61,14 @@ int main(int argc, char ** argv) {
tokens_list = ::llama_tokenize(ctx, params.prompt, true);
const int n_ctx = llama_n_ctx(ctx);
const int n_kv_req = tokens_list.size() + (n_len - tokens_list.size());
const int n_kv_req = tokens_list.size() + (n_predict - tokens_list.size());
LOG_TEE("\n%s: n_len = %d, n_ctx = %d, n_kv_req = %d\n", __func__, n_len, n_ctx, n_kv_req);
LOG_TEE("\n%s: n_predict = %d, n_ctx = %d, n_kv_req = %d\n", __func__, n_predict, n_ctx, n_kv_req);
// make sure the KV cache is big enough to hold all the prompt and generated tokens
if (n_kv_req > n_ctx) {
LOG_TEE("%s: error: n_kv_req > n_ctx, the required KV cache size is not big enough\n", __func__);
LOG_TEE("%s: either reduce n_len or increase n_ctx\n", __func__);
LOG_TEE("%s: either reduce n_predict or increase n_ctx\n", __func__);
return 1;
}
@@ -115,7 +107,7 @@ int main(int argc, char ** argv) {
const auto t_main_start = ggml_time_us();
while (n_cur <= n_len) {
while (n_cur <= n_predict) {
// sample the next token
{
auto n_vocab = llama_n_vocab(model);
@@ -134,7 +126,7 @@ int main(int argc, char ** argv) {
const llama_token new_token_id = llama_sample_token_greedy(ctx, &candidates_p);
// is it an end of generation?
if (llama_token_is_eog(model, new_token_id) || n_cur == n_len) {
if (llama_token_is_eog(model, new_token_id) || n_cur == n_predict) {
LOG_TEE("\n");
break;

View File

@@ -27,7 +27,8 @@ struct seq_draft {
int main(int argc, char ** argv) {
gpt_params params;
if (gpt_params_parse(argc, argv, params) == false) {
if (!gpt_params_parse(argc, argv, params)) {
gpt_params_print_usage(argc, argv, params);
return 1;
}

View File

@@ -302,7 +302,7 @@ static struct ggml_tensor * llama_build_train_graphs(
const int rope_mode = 0;
return ggml_rope_ext(
ctx, t, KQ_pos, nullptr, n_rot, rope_mode, n_ctx, 0, rope_freq_base, rope_freq_scale, 0.0f, 1.0f, 0.0f, 0.0f
ctx, t, KQ_pos, nullptr, n_rot, rope_mode, n_ctx, rope_freq_base, rope_freq_scale, 0.0f, 1.0f, 0.0f, 0.0f
);
};

20
flake.lock generated
View File

@@ -5,11 +5,11 @@
"nixpkgs-lib": "nixpkgs-lib"
},
"locked": {
"lastModified": 1715865404,
"narHash": "sha256-/GJvTdTpuDjNn84j82cU6bXztE0MSkdnTWClUCRub78=",
"lastModified": 1717285511,
"narHash": "sha256-iKzJcpdXih14qYVcZ9QC9XuZYnPc6T8YImb6dX166kw=",
"owner": "hercules-ci",
"repo": "flake-parts",
"rev": "8dc45382d5206bd292f9c2768b8058a8fd8311d9",
"rev": "2a55567fcf15b1b1c7ed712a2c6fadaec7412ea8",
"type": "github"
},
"original": {
@@ -20,11 +20,11 @@
},
"nixpkgs": {
"locked": {
"lastModified": 1716509168,
"narHash": "sha256-4zSIhSRRIoEBwjbPm3YiGtbd8HDWzFxJjw5DYSDy1n8=",
"lastModified": 1717786204,
"narHash": "sha256-4q0s6m0GUcN7q+Y2DqD27iLvbcd1G50T2lv08kKxkSI=",
"owner": "NixOS",
"repo": "nixpkgs",
"rev": "bfb7a882678e518398ce9a31a881538679f6f092",
"rev": "051f920625ab5aabe37c920346e3e69d7d34400e",
"type": "github"
},
"original": {
@@ -36,14 +36,14 @@
},
"nixpkgs-lib": {
"locked": {
"lastModified": 1714640452,
"narHash": "sha256-QBx10+k6JWz6u7VsohfSw8g8hjdBZEf8CFzXH1/1Z94=",
"lastModified": 1717284937,
"narHash": "sha256-lIbdfCsf8LMFloheeE6N31+BMIeixqyQWbSr2vk79EQ=",
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz"
"url": "https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz"
},
"original": {
"type": "tarball",
"url": "https://github.com/NixOS/nixpkgs/archive/50eb7ecf4cd0a5756d7275c8ba36790e5bd53e33.tar.gz"
"url": "https://github.com/NixOS/nixpkgs/archive/eb9ceca17df2ea50a250b6b27f7bf6ab0186f198.tar.gz"
}
},
"root": {

Some files were not shown because too many files have changed in this diff Show More