* Vulkan loader code
* Fix matmul kernel, continue implementation
* Continue implementation
* Vulkan memory management
* Vulkan development
* Matmul call
* Add aligned malloc and free for VMA
* Continue implementation
* First matmul success
* GEMM Kernel optimization
* 1D Blocktiling
* 2D Blocktiling
* Write coalescing
* Continue vulkan implementation and optimization
* First FP16 attempt, disabled for now
* Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel
* Enable device extensions properly, restore fp16 matmul op
* Fix mulmat_f16
* Output FP32 in fp16 matmul shader
* Fix f16_to_f32 kernel
* dequant_q4_0 kernel
* Add VMA library
* Avoid requesting dedicated memory, VMA can decide that by itself
* Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly
* add cmake commands
* Add 2d write operation, profiling code
* Fix 2d write
* Fix queue selection for AMD RADV
* Fix trailing whitespace in vk_mem_alloc.h
* Add WIP warp tile mat mul shaders
* Disable glslc optimization
* Disable glslc optimization for CMake
* Optimize warptile matmul shader, replace blocktile with it
* Add split-k optimization for small matrix multiplication
Use semaphores for synchronization instead of fences or waitidle
Rework async write/read for synchronization
* Fix validation errors, improve compatibility with AMD GPUs
* Rework command buffer handling
* Variable matmul kernel using specialization constants
* Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints
* Reuse semaphores
* Handle stage flags during command buffer submission properly
* Increase matmul test runs for consistent results
* Fix F32 matmul
* Add vectorized loading and zeropadding for matrix multiplication
* Use pinned memory for f16 preprocessing
* Don't force aligned matmul
* Don't free before queue done
* Replace VMA library with native Vulkan buffer management
* Basic offloading support with mul_f32 and dmmv for q4_0
* Run glslc commands in parallel
* Unroll loops in dmmv shader
* Reduce usage of waitIdle
* Reuse pinned allocation for f16 conversion
* Handle devices with only a single queue
* Fix trailing whitespace in CMakeLists.txt
* Allow parallel execution of kernels, parallelize third and fourth dimension calls
* Add fallback for devices only supporting one DescriptorSet per DescriptorPool
* Move to graph function similar to CUDA implementation
* Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function
* Add F32 dmmv shaders
* Batch submissions
* Add .spv to gitignore
* Split off matrix vector multiplication for separate optimization
* Use single command buffer for matrix vector multiplication ops
* Reduce overhead of mul_f32 calls by using a single command buffer
* Add submission batching to mul_f32
* Fix tests
* Add missing barrier
* Add further missing barrier
* Add further ops
* Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions
* Remove unnecessary cblas link
* Fix descriptor set pre-allocation assert
* Add runtime shader compilation, start transferring shaders to this approach
* Transfer remaining shaders to header and compile on runtime
* Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16
* Add support for q4_1, q5_0, q5_1 and q8_0
* Remove unnecessary scalar layout extension
* Parse graph early to pre-record command buffers
* Add q6_k support
* Add multi-submit for command buffers
* Fix q6_k dequant shader for AMD
* Fix q6_k for GPUs without fp16 support
* Simplify q6_k fp16 fix
* Minor fixes
* Fix wg_denom of m-mulmat shaders
* Add Python-based Vulkan shader generator
* Replace shaderc dependency with precompiled shaders
Fix python script to generate shaders
* Clean up code
* Fix shader generator script Windows compatibility
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
* Close file before deletion
* Fix vulkan shader fp32 name
* Add q2_k and q3_k support
Add validation check to compare shader results to cpu results
* Add q4_k support
* Add q5_k support
* Bake SPIR-V bytecode into the library instead of loading shaders from file
* Switch to signal semaphores for flexibility
Prepare broadcasting support for mul mat
* Finish broadcasting mul mat support for GQA
* Clean up unused functions
Add repeat op
* Add further ops, not yet enabled. Improve semaphore code
* Reduce number of used semaphores by utilizing timelines more properly
* Remove queue information
* Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations
* Add Vulkan to llama-bench
* Remove cblas dependency
* Fix matmul k-split bug
* Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader
* Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug
* Fix issues with float16 overflows in shaders
* Fix issues with older Vulkan headers on Ubuntu 22.04
* Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers
* Implement further ops, rework op_f32 calls, fix bugs
* Finish full offloading support, add last remaining ops, fix bugs, remove redundant code
* Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders
* Merge upstream changes, fix conflicts, adapt soft_max op
* Fix Python and shader header format
* Free model gpu buffers on exit
* Use single queue per device to simplify code
* Add matmul shader support for running multiple calculations in parallel
* Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible
* Fix missing event cast
* Replace uint64_t(-1) with UINT64_MAX, rename function for clarity
* Fix warning about empty C function parameters
* Fix compiler warnings
* Properly implement Vulkan backend buffer handling
* Fix oversized host staging buffers
* Simplify barrier synchronization calls
* Fix gcc warnings
* Implement max_size for backend buffer types to limit the size of a single allocation
* Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size
* refactor multi buf
* Disable unsupported ops to fix tests
* Check for maintenance4 support before using it
* Handle devices with only a single queue
* Fix single queue logic
* propagate buffer usage in multi buffers
* Implement rope_neox op
* Cleanup header and other files
* Simplify gpu_extras by removing events and putting staging memcpys into contexts
* Move queue into context
Add not-yet-enabled async backend ops
* Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization
* Add get_max_size to SYCL backend.
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : fix trailing whitespace
---------
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* feat: add Dockerfiles for each platform that user ./server instead of ./main
* feat: update .github/workflows/docker.yml to build server-first docker containers
* doc: add information about running the server with Docker to README.md
* doc: add information about running with docker to the server README
* doc: update n-gpu-layers to show correct GPU usage
* fix(doc): update container tag from `server` to `server-cuda` for README example on running server container with CUDA
* Support for Yi-VL, templating fix for mobileVLM
* ws
* Update examples/llava/clip.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llava-cli.cpp
* Update clip.cpp
bugfix for new conversions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
updated the n_task calculation to use max number of
threads possible. This has improved the prompt eval
performance by around 5% for DOT kernels and by
around 10% for MMLA kernels on AWS Graviton3.
* server: add llama_server_queue struct
* server: add llama_server_response_event
* server: add comments
* server: move all mutexes away from server.cpp
* server: correct multitask response
* server: only add back deferred tasks when one slot is available
* server: fix a race condition cause by "request_completion"
* scripts : add lib.sh and lib_test.sh
* scripts : stub out new ci-run.sh script
* scripts : switch to PascalCase for functions
This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.
* scripts : add some fancy conversion from snake_case to PascalCase
* Add venv to ci/run.sh
* Revert scripts work
* scripts : add wrapper script for local use of ci/run.sh
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* Update test-model-load-cancel
* ci : add ctest_with_model for debug and release
ggml-ci
* Fix gg_get_model function
ggml-ci
* got stuck on CMake
* Add get_model.cpp to tests/CMakeLists.txt
ggml-ci
* Fix README.md output for ctest_with_model
ggml-ci
* workflows : use `-L main` for all ctest
ggml-ci
* Fixes
* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
set
* scripts : update usage text for ci-run.sh
* implemented dynamic temperature sampling from koboldcpp
* removed trailing whitespace
* removed unused temp parameter in llama_sample_entropy
* exposed exponent_val in dynamic temp sampler
* added debug check for printf statements
* use nullptr in llama_sample_softmax call during llama_sample_entropy
this avoids counting the time taken stats twice
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)
* reformat 't' case in llama_sample_queue
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* check for one or zero candidates case in llama_sample_entropy
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
this fixes the error I encountered when trying to run the convert.py
script in a venv:
```
$ nix develop
[...]$ source .venv/bin/activate
(.venv)
[...]$ pip3 install -r requirements.txt
<... clipped ...>
[...]$ python3 ./convert.py
Traceback (most recent call last):
File "/home/mhueschen/projects-reference/llama.cpp/./convert.py", line 40, in <module>
from sentencepiece import SentencePieceProcessor
File "/home/mhueschen/projects-reference/llama.cpp/.venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 13, in <module>
from . import _sentencepiece
ImportError: libstdc++.so.6: cannot open shared object file: No such file or directory
```
however, I am not sure this is the cleanest way to address this linker
issue...
* kl-divergence: be able to save all logits to a file
* Add ability to compute KL-divergence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* make GGML_TASK_INIT phase can be run in multithread
* multithreaded dequantize in mul_mat when using blas library
* minor fixes
* update outdated comment
* fix coding style
* simplify code
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* MobileVLM native implementation
* delete depthwise_conv_2d and permute_cpy relative code, replace the two by the existed functions, and opt ldp definition, support LLAMA_PERF option for CMake
* move android script to example/llava directory
* Fix the editor config checks
---------
Co-authored-by: Chenxiaotao03 <chenxiaotao03@meituan.com>
* llama : support StableLM 2 1.6B
* convert : fix Qwen's set_vocab wrongly naming all special tokens [PAD{id}]
* convert : refactor Qwen's set_vocab to use it for StableLM 2 too
* nix : add tiktoken to llama-python-extra
* convert : use presence of tokenizer.json to determine StableLM tokenizer loader
It's a less arbitrary heuristic than the vocab size.
This commit adds `--sample-start` and `--include-sample-start` to the
output from the main function in finetune.cpp.
The motivation for this is that even though these are set explicitly by
the user via the command line, if one forgets to set them then it is
useful to have their values printed out. Otherwise it is possible to go
through the whole training process before realizing that the values are
not what one expected.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S
* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K
Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* TruthfulQA: 1st attempt, does not look like it is working
The same implementation can be used for HellaSwag as well,
so I converted a HellaSwag validation dataset to the binary
format used here and tested with that. The score is only
around 50, so something is not quite right.
* TruthfulQA: works but the result is bad
I know it works because if I convert the HellaSwag validation
data to the binary format used in the truthful_qa_score() function
I get the exact same result as from the hellaswag_score() function.
But I guess, the questions are tricky and the way I have done
the combination of question + answer is very likely not the best.
The TruthfulQA validation dataset contains 817 questions, with
random chance result around 19%. With this version I get
29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2.
The HF leader board results for these two models are
42.2% and 68.3%, respectively.
* TruthfulQA: fix random sample
* TruthfulQA: prepare tasks in parallel for large test datasets
* Rename truthful_qa to multiple_choice
* Make MSVC happy
I had forgotten that MSVC does not make constexpr's available
inside a lambda.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
PR #4818 (merged last week) reintroduced a config check for vocab_size that was addressed in PR #4258 (merged 2023-11-30).
Without the fix, llama2 models can't be converted. The error is:
`ValueError: The model's vocab size is set to -1 in params.json. Please update it manually. Maybe 32000?`
For Mistral-7B and fp16, time on my system goes down from 536 seconds
to 423 seconds for the full evaluation dataset (10042 tasks).
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* winogrande: simple implementation
It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
* winogrande: somewhat better
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
* winogrande: improving
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.
It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.
It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
* winogrande: add dataset instructions
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute
* SPM header potential fix
* Reverting symlinks
* ggml : add IQ2 to test-backend-ops + refactoring
ggml-ci
* cuda : update supports_op for IQ2
ggml-ci
* ci : enable LLAMA_CUBLAS=1 for CUDA nodes
ggml-ci
* cuda : fix out-of-bounds-access in `mul_mat_vec_q`
ggml-ci
* tests : avoid creating RNGs for each Q tensor
ggml-ci
* tests : avoid creating RNGs for each tensor
ggml-ci
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* imatrix : offload to GPU support
* imatrix : fix ggml_mul_mat_id hanlding
ggml-ci
* ci : add imatrix test
ggml-ci
* ci : rearrange output
ggml-ci
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* simple : no need for ggml_is_contiguous + fix bool parse
* llama : fix callback placement in llama_context_params
* backend : avoid double-ask callback calls
* simple : restore examples, imatrix will serve as a demo
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* examples : save-load-state: save only required state
* llama : only reserve n_vocab * n_batch at most for logits
llama_decode asserts that only n_batch tokens are passed each call, and
n_ctx is expected to be bigger than n_batch.
* llama : always reserve n_vocab * n_batch for logits
llama_context de-serialization breaks if the contexts have differing
capacity for logits and llama_decode will at maximum resize to
n_vocab * n_batch.
* llama : only save and restore used logits
for batch sizes of 512 this reduces save state in the best case by
around 62 MB, which can be a lot if planning to save on each message
to allow regenerating messages.
* llama : use ostringstream and istringstream for save and load
* llama : serialize rng into minimum amount of space required
* llama : break session version due to serialization changes
* add the parameter : --no-display-prompt , combine with --log-disable it will display only the generated tokens
* remove empty line
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : detect more GPU families
* metal : refactor kernel loading
* metal : set kernel family requirements
* metal : fix kernel init + fix compile options
* metal : take into account simdgroup reduction support
* metal : print only skipped kernels
* metal : fix check for simdgroup reduction support
* metal : check for Metal 3
* metal : free allocations
* metal : normalize encoder:setComputePipelineStatus calls
ggml-ci
* metal : fix Metal3 family check
ggml-ci
* metal : check for simdgroup matrix mul. feature
ggml-ci
* llama : ggml-backend integration
* ggml-backend : add names to buffers
* fix unmap after loading
* batched-bench : add tensor_split param
* llama : check for null tensor_split
* ggml-backend : increase GGML_MAX_BACKENDS
* improve graph splitting, partial fix for --no-kv-offload
* cuda : add ggml-backend split buffer support
* cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available)
* ggml : fix null backend dereference (#4807)
* ggml : fix null backend dereference
* ggml : also check ggml_backend_is_cpu
* test-backend-ops : check buffer allocation failures
* llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row)
* ggml : fix mul_mat_id work size
* llama : rewrite session kv load/set without graphs
* minor
* llama : only initialize used backends, free backends on context free
* llama : abort ctx if cuda backend init fails
* llama : rewrite lora with ggml-backend and compute on CPU
ggml-ci
* llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer
* opencl : add ggml-backend buffer type
* cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf)
* llama : on Metal, by default offload the full model
ggml-ci
* metal : page align the data ptr (#4854)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cuda : fix split buffer free
* address review comments
* llama-bench : add split-mode parameter
* fix whitespace
* opencl : fix double initialization
* server : add --split-mode parameter
* use async copy and compute to improve multi-gpu performance
ggml-ci
* use async memcpys to copy the graph outputs to the CPU
* fix opencl
* use a host buffer for the cpu compute buffer for faster copies to the gpu
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit replaces the magic number used in export-lora.cpp with
the one defined in llama.h, which is indirectly included via common.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Updated Models Layout
- Added a models drawer
- Added downloading directly from Hugging Face
- Load custom models from local folder
- Delete models by swiping left
* trimmed trailing white space
* Updated Models Layout
* common : streamline the formatting of help
- Separate alternative parameters by a comma
- Do not indent `--version` differently
* Update common/common.cpp
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Restore intended k-quants quantization mixes for MoE models
* Update Q2_K_S values in the quantize tool
Still using LLaMA-v1 PPL values in the quant description
today does not make much sense. But let's leave this update
for another PR.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq2_xs: basics
* iq2_xs: this should have been in the basics
* iq2_xs: CUDA and scalar CPU works
* iq2_xs: WIP Metal
* iq2_xs: Metal now works
* iq2_xs: working, but dog slow, ARM_NEON dot product
* iq2_xs: better ARM_NEON dot product
We are now at 19.5 t/s for TG-128 and 61 t/s for PP-512 when
running on the CPU.
* iq2_xs: AVX2 dot product - 19.5 t/s
* iq2_xs: faster AVX2 dit product
21.4 t/s for TG-128, 59.2 t/s for PP-512.
The latter is 2x compared to the previous version.
* iq2_xs: had forgotten to delete iq2-data.h
* Add llama enum for IQ2_XS
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: added support for multiple api keys, added loading api keys from file
* minor: fix whitespace
* added file error handling to --api-key-file, changed code to better
reflect current style
* server: update README.md for --api-key-file
---------
Co-authored-by: Michael Coppola <info@michaeljcoppola.com>
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
* updated `server` readme to document the `/health` endpoint too
* used LOG_INFO after successful model loading
NULL can be an integer constant expression with the value zero, in this case the behavior would be undefined because of an incorrect type being passed to the variable arguments.
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
* updated `server` readme to document the `/health` endpoint too
* added /health endpoint to the server
* added comments on the additional /health endpoint
* Better handling of server state
When the model is being loaded, the server state is `LOADING_MODEL`. If model-loading fails, the server state becomes `ERROR`, otherwise it becomes `READY`. The `/health` endpoint provides more granular messages now according to the server_state value.
* initialized server_state
* fixed a typo
* starting http server before initializing the model
* Update server.cpp
* Update server.cpp
* fixes
* fixes
* fixes
* made ServerState atomic and turned two-line spaces into one-line
This update categorizes models with 24 layers as MODEL_1B, ensuring compatibility with different Phi model variants without impacting existing Phi-2 model functionality.
Uses ggml functions instead of hardcoded names and adds support to quantize into the modern Q-K variants.
This is just the bare minimum to get k-types working - a more refined choice of types would be needed to get best quality on low quantizations.
I ran a few tests, it doesn't break anything I could notice and a Q6_K ViT works almost as well as Q8_0 but 3 times the inference speed.
* Update Imports and Add Notes for Future Reference
- Updated import statements in `convert.py`.
- Added import for `AutoTokenizer` from `transformers` module.
- Added conditional import for `gguf` from the local directory.
- Added comments and notes for future reference.
Additional Notes:
- Noted removal of a redundant `TypeAlias` import.
- Noted the removal of a `gguf` debug statement.
- Commented on the presence of `ARCH` and `NDArray` definitions.
- Commented on cleaning up and refactoring data type definitions.
* Refine Model Hyperparameters and Params Class
- Updated type annotations to use `Optional` for clarity.
- Improved method names and attribute consistency.
- Removed unnecessary variables for better code readability.
Additional Notes:
- Highlighted the use of `Optional` for clearer intent.
- Ensured backward and forward compatibility.
* Restore BpeVocab and SentencePieceVocab classes
- Restored the BpeVocab class for handling BPE tokenization.
- Restored the SentencePieceVocab class for SentencePiece tokenization.
These classes are essential for maintaining the original behavior of the codebase.
* refactor: Standardize vocabulary handling with HfVocab
- Replaced VocabLoader with HfVocab, aligning vocabulary handling across classes.
- Updated initialization of HfVocab with local_files_only=True for AutoTokenizer.
- Introduced optional parameter fname_added_tokens for flexible added token management.
- Streamlined added token handling for clarity and conciseness.
- Maintained special tokens and IDs, enhancing token management.
- Simplified token processing methods for improved readability.
- Added a placeholder for score computation with a default value of -1000.0.
- Optimized newline token check for efficiency.
- Updated __repr__ function for clarity in representation.
- Adjusted type alias Vocab to include BpeVocab, SentencePieceVocab, and HfVocab.
- Removed redundant code related to special token handling, reverse vocabulary mapping, and vocabulary file detection.
This refactoring promotes a standardized and modular approach to vocabulary management, facilitating future integration with a VocabFactory and improving code maintainability and scalability.
* refactor: Enhance readability, functionality, and code quality
- Improved code formatting and readability for better maintainability.
- Refactored LazyUnpickler's CLASSES dictionary for clarity.
- Added print statements and warnings in check_vocab_size for user feedback.
- Removed find_vocab_file_path, as it's superseded by VocabFactory.
- Preparatory changes for upcoming classes: OutputFile and VocabFactory.
- Overall focus on code quality, error handling, and consistency.
These changes reflect a continuous effort to refine the codebase, ensuring it meets best practices and prepares for future enhancements, such as the VocabFactory.
* refactor: Update OutputFile class for enhanced model vocabulary management
- Restructured the constructor for improved readability.
- Updated `add_meta_arch` method for flexible model name determination.
- Introduced `handle_tokenizer_model` for mapping vocab types to supported tokenizer models.
- Streamlined vocabulary extraction with `extract_vocabulary_from_model`.
- Simplified vocabulary metadata addition using `add_meta_vocab`.
- Refactored `add_tensor_info` for clarity and consistency.
- Improved error handling for better user feedback.
These changes signify the development of a versatile and comprehensive `OutputFile` class, enabling efficient management of model conversion output, metadata, vocabulary, and tensor information.
* feat: Introduce VocabFactory for flexible vocabulary management in model conversion
- The VocabFactory class is added to facilitate modular vocabulary handling.
- The constructor initializes a directory path and detects vocabulary-related files.
- The _select_file method provides file paths based on vocabulary type (e.g., BPE, SentencePiece).
- _create_special_vocab generates special vocabularies, accommodating different types.
- The load_vocab method loads vocabularies, handling BPE, SentencePiece, and Hugging Face Fast Tokenizer.
- Error handling and logging enhance debugging and user feedback.
- The modular and flexible design simplifies vocabulary management and supports future extensions.
The VocabFactory class enhances code modularity and maintainability, allowing versatile vocabulary handling in the model conversion process.
* refactor: Improve code organization, argument parsing, and user interface
- Renamed 'default_outfile' to 'default_output_file' for clarity.
- Refactored argument parser setup into 'get_argument_parser' function.
- Introduced descriptive comments for each argument in the parser.
- Added '--vocab-type' argument with choices ["spm", "bpe", "hfft"] for vocabulary processing.
- Improved flag naming consistency: '--outfile' to '--out-file' and '--bigendian' to '--big-endian'.
- Enhanced error handling to prevent overwriting input data in 'default_output_file'.
- Made 'argv' in 'main' an optional parameter for flexibility.
- Introduced dynamic import for 'awq.apply_awq' based on 'args.awq_path' for conditional dependency.
These changes enhance code clarity, organization, and the user interface of the script, aligning it with Python best practices and improving maintainability.
* refactor: Further refine functionality, improve user interaction, and streamline vocabulary handling
- Renamed command-line arguments for clarity and consistency.
- Improved path resolution and import adjustments for robustness.
- Thoughtfully handled 'awq-path' and conditional logic for the weighted model.
- Enhanced model and vocabulary loading with the 'VocabFactory' class for structured and adaptable loading.
- Strengthened error handling and user feedback for a more user-friendly experience.
- Structured output file handling with clear conditions and defaults.
- Streamlined and organized the 'main' function for better logic flow.
- Passed 'sys.argv[1:]' to 'main' for adaptability and testability.
These changes solidify the script's functionality, making it more robust, user-friendly, and adaptable. The use of the 'VocabFactory' class is a notable enhancement in efficient vocabulary handling, reflecting a thoughtful and iterative approach to script development.
* chore: Apply ruff formatting to convert.py
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
* Revert to commit 0614c33
* chore: Apply flake8 formatting rules
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
* refactor: Revise `check_vocab_size` for Enhanced Clarity and Correctness
- Resolved an unreachable branch issue by reorganizing the conditional structure.
- Moved the special case check for `params.n_vocab == -1` to the top for immediate assertion.
- Flattened the conditional logic for improved clarity and predictability of the function's behavior.
These changes enhance the readability and functional correctness of the `check_vocab_size` function without altering its intended functionality.
* py : fix outfile and outtype
* py : suggest hint for missing vocab size
---------
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This change fixes an issue where supplying `--image missing-file` would
result in a segfault due to a null pointer being dereferenced. This can
result in distracting info being printed if robust crash analysis tools
are being used.
* updated server readme to reflect the gg/server-token-probs-4088 commit
added explanation for the API's completion result which now includes `completion_probabilities`. Also added a JSON schema that shows the type/structure of `completion_probabilities`.
* simplified the `completion_probabilities` JSON schema
It's now easier to understand what the structure of `completion_probabilities` looks like.
* minor : fix trailing whitespace
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq2_xxs: basics
* iq2_xxs: scalar and AVX2 dot products
Needed to change Q8_K to have quants in the -127...127 range,
else the IQ2_XXS AVX implementation becomes very awkward.
The alternative would have been to use Q8_0 instead. Perhaps
I'll change later, for now this is what we have.
* iq2_xxs: ARM_NEON dot product
Somehow strangely slow (112 ms/token).
* iq2_xxs: WIP Metal
Dequantize works, something is still wrong with the
dot product.
* iq2_xxs: Metal dot product now works
We have
PP-512 = 475 t/s
TG-128 = 47.3 t/s
Not the greatest performance, but not complete garbage either.
* iq2_xxs: slighty faster dot product
TG-128 is now 48.4 t/s
* iq2_xxs: slighty faster dot product
TG-128 is now 50.9 t/s
* iq2_xxs: even faster Metal dot product
TG-128 is now 54.1 t/s.
Strangely enough, putting the signs lookup table
into shared memory has a bigger impact than the
grid values being in shared memory.
* iq2_xxs: dequantize CUDA kernel - fix conflict with master
* iq2_xxs: quantized CUDA dot product (MMVQ)
We get TG-128 = 153.1 t/s
* iq2_xxs: slightly faster CUDA dot product
TG-128 is now at 155.1 t/s.
* iq2_xxs: add to llama ftype enum
* iq2_xxs: fix MoE on Metal
* Fix missing MMQ ops when on hipBLAS
I had put the ggml_supports_mmq call at the wrong place.
* Fix bug in qequantize_row_iq2_xxs
The 0.25f factor was missing.
Great detective work by @ggerganov!
* Fixing tests
* PR suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
* metal : optimizing ggml_mul_mat_id (wip)
* metal : minor fix
* metal : opt mul_mm_id
* Add n_key_dim and n_value_dim
Some models use values that are not derived from `n_embd`.
Also remove `n_embd_head` and `n_embd_gqa` because it is not clear
which "head" is referred to (key or value).
Fix issue #4648.
* Fix `llm_build_kqv` to use `n_value_gqa`
* Rebase
* Rename variables
* Fix llm_build_kqv to be more generic wrt n_embd_head_k
* Update default values for n_embd_head_k and n_embd_head_v
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix llm_load_tensors: the asserts were not backcompat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Changes to server to allow metadata override
* documentation
* flake.nix: expose full scope in legacyPackages
* flake.nix: rocm not yet supported on aarch64, so hide the output
* flake.nix: expose checks
* workflows: nix-ci: init; build flake outputs
* workflows: nix-ci: add a job for eval
* workflows: weekly `nix flake update`
* workflows: nix-flakestry: drop tag filters
...and add a job for flakehub.com
* workflows: nix-ci: add a qemu job for jetsons
* flake.nix: suggest the binary caches
* flake.lock: update
to a commit recently cached by nixpkgs-cuda-ci
---------
Co-authored-by: John <john@jLap.lan>
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
ggml-ci
* feat: add avx_vnni based on intel documents
* ggml: add avx vnni based on intel document
* llama: add avx vnni information display
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* docs: add more details about using oneMKL and oneAPI for intel processors
* Update ggml.c
Fix indentation upgate
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* python: add check-requirements.sh and GitHub workflow
This script and workflow forces package versions to remain compatible
across all convert*.py scripts, while allowing secondary convert scripts
to import dependencies not wanted in convert.py.
* Move requirements into ./requirements
* Fail on "==" being used for package requirements (but can be suppressed)
* Enforce "compatible release" syntax instead of ==
* Update workflow
* Add upper version bound for transformers and protobuf
* improve check-requirements.sh
* small syntax change
* don't remove venvs if nocleanup is passed
* See if this fixes docker workflow
* Move check-requirements.sh into ./scripts/
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* flake.lock: update to hotfix CUDA::cuda_driver
Required to support https://github.com/ggerganov/llama.cpp/pull/4606
* flake.nix: rewrite
1. Split into separate files per output.
2. Added overlays, so that this flake can be integrated into others.
The names in the overlay are `llama-cpp`, `llama-cpp-opencl`,
`llama-cpp-cuda`, and `llama-cpp-rocm` so that they fit into the
broader set of Nix packages from [nixpkgs](https://github.com/nixos/nixpkgs).
3. Use [callPackage](https://summer.nixos.org/blog/callpackage-a-tool-for-the-lazy/)
rather than `with pkgs;` so that there's dependency injection rather
than dependency lookup.
4. Add a description and meta information for each package.
The description includes a bit about what's trying to accelerate each one.
5. Use specific CUDA packages instead of cudatoolkit on the advice of SomeoneSerge.
6. Format with `serokell/nixfmt` for a consistent style.
7. Update `flake.lock` with the latest goods.
* flake.nix: use finalPackage instead of passing it manually
* nix: unclutter darwin support
* nix: pass most darwin frameworks unconditionally
...for simplicity
* *.nix: nixfmt
nix shell github:piegamesde/nixfmt/rfc101-style --command \
nixfmt flake.nix .devops/nix/*.nix
* flake.nix: add maintainers
* nix: move meta down to follow Nixpkgs style more closely
* nix: add missing meta attributes
nix: clarify the interpretation of meta.maintainers
nix: clarify the meaning of "broken" and "badPlatforms"
nix: passthru: expose the use* flags for inspection
E.g.:
```
❯ nix eval .#cuda.useCuda
true
```
* flake.nix: avoid re-evaluating nixpkgs too many times
* flake.nix: use flake-parts
* nix: migrate to pname+version
* flake.nix: overlay: expose both the namespace and the default attribute
* ci: add the (Nix) flakestry workflow
* nix: cmakeFlags: explicit OFF bools
* nix: cuda: reduce runtime closure
* nix: fewer rebuilds
* nix: respect config.cudaCapabilities
* nix: add the impure driver's location to the DT_RUNPATHs
* nix: clean sources more thoroughly
...this way outPaths change less frequently,
and so there are fewer rebuilds
* nix: explicit mpi support
* nix: explicit jetson support
* flake.nix: darwin: only expose the default
---------
Co-authored-by: Someone Serge <sergei.kozlukov@aalto.fi>
This change makes it possible to use flags like `--grammar` when using
the `llava-cli` program. The rest is just code cleanup deleting a long
standing TODO comment.
This change also ensures that logging information is emitted to stderr
which helps the `llava-cli` command be more friendly to shell scripts.
See Mozilla-Ocho/llamafile@1cd334f
The server currently schedules tasks using a sleep(5ms) busy loop. This
adds unnecessary latency since most sleep implementations do a round up
to the system scheduling quantum (usually 10ms). Other libc sleep impls
spin for smaller time intervals which results in the server's busy loop
consuming all available cpu. Having the explicit notify() / wait() code
also helps aid in the readability of the server code.
See mozilla-Ocho/llamafile@711344b
* fixed mul-mat error for old GPUs
* style fixes
* add mul mat src1 f16 test cases, fix more cases
ggml-ci
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
The default values for tfs_z and typical_p were being set to zero, which
caused the token candidates array to get shrunk down to one element thus
preventing any sampling. Note this only applies to OpenAI API compatible
HTTP server requests.
The solution is to use the default values that OpenAI documents, as well
as ensuring we use the llama.cpp defaults for the rest. I've tested this
change still ensures deterministic output by default. If a "temperature"
greater than 0 is explicitly passed, then output is unique each time. If
"seed" is specified in addition to "temperature" then the output becomes
deterministic once more.
See mozilla-Ocho/llamafile#117
See mozilla-Ocho/llamafile@9e4bf29
* cuda : fix vmm pool with multi GPU
* hip
* use recommended granularity instead of minimum
* better error checking
* fix mixtral
* use cudaMemcpy3DPeerAsync
* use cuda_pool_alloc in ggml_cuda_op_mul_mat
* consolidate error checking in ggml_cuda_set_device
* remove unnecessary inlines
ggml-ci
* style fixes
* only use vmm for the main device
* fix scratch buffer size, re-enable vmm pool for all devices
* remove unnecessary check id != g_main_device
* cuda : improve cuda pool efficiency using virtual memory
* fix mixtral
* fix cmake build
* check for vmm support, disable for hip
ggml-ci
* fix hip build
* clarify granularity
* move all caps to g_device_caps
* refactor error checking
* add cuda_pool_alloc, refactor most pool allocations
ggml-ci
* fix hip build
* CUBLAS_TF32_TENSOR_OP_MATH is not a macro
* more hip crap
* llama : fix msvc warnings
* ggml : fix msvc warnings
* minor
* minor
* cuda : fallback to CPU on host buffer alloc fail
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml-cuda.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ensure allocations are always aligned
* act_size -> actual_size
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Check the full vocab for grammar only if necessary
* Fix missing logit restoration step (?)
Does this matter, actually?
* Fix whitespace / formatting
* Adjust comment
* Didn't mean to push test gbnf
* Split sampling into the helper function (?)
And also revert the changes made to the header
* common : fix final newline
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* initial commit, going through initializations
* main loop finished, starting to debug
* BUG: generates gibberish/repeating tokens after a while
* kv_cache management
* Added colors to distinguish drafted tokens (--color). Updated README
* lookup : fix token positions in the draft batch
* lookup : use n_draft from CLI params
* lookup : final touches
---------
Co-authored-by: Leon Ericsson <leon.ericsson@icloud.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix old jetson compile error
* Update Makefile
* update jetson detect and cuda version detect
* update cuda marco define
* update makefile and cuda,fix some issue
* Update README.md
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update Makefile
* Update README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : Add ability to cancel model load
Updated llama_progress_callback so that if it returns false, the model
loading is aborted.
* llama : Add test for model load cancellation
* Fix bool return in llama_model_load, remove std::ignore use
* Update llama.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Fail test if model file is missing
* Revert "Fail test if model file is missing"
This reverts commit 32ebd525bf.
* Add test-model-load-cancel to Makefile
* Revert "Revert "Fail test if model file is missing""
This reverts commit 2796953257.
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* ci : get ci/run.sh working with test-model-load-cancel
* ci : restrict .github/workflows/build.yml ctest to -L main
* update requirements.txt
* Disable test-model-load-cancel in make
* Remove venv before creation
* Restructure requirements.txt
Top-level now imports the specific additional requirements for each
python file. Using `pip install -r requirements.txt` will fail if
versions become mismatched in the per-file requirements.
* Make per-python-script requirements work alone
This doesn't break the main requirements.txt.
* Add comment
* Add convert-persimmon-to-gguf.py to new requirements.txt scheme
* Add check-requirements.sh script and GitHub workflow
* Remove shellcheck installation step from workflow
* Add nocleanup special arg
* Fix merge
see: https://github.com/ggerganov/llama.cpp/pull/4462#discussion_r1434593573
* reset to upstream/master
* Redo changes for cancelling model load
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* llama : initial ggml-backend integration
* add ggml-metal
* cuda backend can be used though ggml-backend with LLAMA_GGML_BACKEND_CUDA_TEST
access all tensor data with ggml_backend_tensor_get/set
* add ggml_backend_buffer_clear
zero-init KV cache buffer
* add ggml_backend_buffer_is_hos, used to avoid copies if possible when accesing tensor data
* disable gpu backends with ngl 0
* more accurate mlock
* unmap offloaded part of the model
* use posix_fadvise64(.., POSIX_FADV_SEQUENTIAL) to improve performance with mmap
* update quantize and lora
* update session copy/set to use ggml-backend
ggml-ci
* use posix_fadvise instead of posix_fadvise64
* ggml_backend_alloc_ctx_tensors_from_buft : remove old print
* llama_mmap::align_offset : use pointers instead of references for out parameters
* restore progress_callback behavior
* move final progress_callback call to load_all_data
* cuda : fix fprintf format string (minor)
* do not offload scales
* llama_mmap : avoid unmapping the same fragments again in the destructor
* remove unnecessary unmap
* metal : add default log function that prints to stderr, cleanup code
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* allowed getting n_batch from llama_context in c api
* changed to use `uint32_t` instead of `int`
* changed to use `uint32_t` instead of `int` in `llama_n_ctx`
* Update llama.h
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* AMD ROCm: handle UMA memory VRAM expansions
This resolves#2797 by allowing ROCm AMD GPU users with a UMA to
dynamically expand the VRAM allocated to the GPU.
Without this, AMD ROCm users with shared CPU/GPU memory usually are
stuck with the BIOS-set (or fixed) framebuffer VRAM, making it
impossible to load more than 1-2 layers.
Note that the model is duplicated in RAM because it's loaded once for
the CPU and then copied into a second set of allocations that are
managed by the HIP UMA system. We can fix this later.
* clarify build process for ROCm on linux with cmake
* avoid using deprecated ROCm hipMallocHost
* keep simplifying the change required for UMA
* cmake: enable UMA-compatible allocation when LLAMA_HIP_UMA=ON
Otherwise, on Windows converting bling-phi-2-v0 (<https://huggingface.co/llmware/bling-phi-2-v0>) via convert-hf-to-gguf.py will fail with the following error:
```
Traceback (most recent call last):
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 1061, in <module>
model_instance.set_vocab()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 52, in set_vocab
self._set_vocab_gpt2()
File "C:\Users\User\git\gguf\convert-hf-to-gguf.py", line 264, in _set_vocab_gpt2
special_vocab = gguf.SpecialVocab(dir_model, load_merges=True)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 33, in __init__
self._load(Path(path))
File "C:\Users\User\git\gguf\gguf\vocab.py", line 81, in _load
self._try_load_merges_txt(path)
File "C:\Users\User\git\gguf\gguf\vocab.py", line 95, in _try_load_merges_txt
for line in fp:
File "C:\Users\User\miniconda3\envs\gguf\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1415: character maps to <undefined>
```
regression of #4490
Adds defines for two new datatypes
cublasComputeType_t, cudaDataType_t.
Currently using deprecated hipblasDatatype_t since newer ones very recent.
* build : Check the ROCm installation location
* more generic approach
* fixup! It was returning the path instead of the command output
* fixup! Trailing whitespace
* Add API key authentication for enhanced server-client security
* server : to snake_case
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : group mul_mat_id rows by matrix (cpu only)
* remove mmid parameters from mm forward
* store row groups in wdata and calculate only once in GGML_TASK_INIT
ggml-ci
* Fixes "Not enough space in the context's memory pool" encountered on certain models, which seems to be caused by some imprecision related to the automatic casting of floating point values
* do not cast to size_t, instead just use doubles
* ggml : add ggml_row_size(), deprecate ggml_type_sizef()
* ggml : fix row size compute to avoid overflows
* tests : fix sizey -> sizez
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* sync : ggml (SD ops, tests, kernels)
ggml-ci
* cuda : restore im2col
ggml-ci
* metal : fix accuracy of dequantization kernels
ggml-ci
* cuda : restore correct im2col
ggml-ci
* metal : try to fix moe test by reducing expert size
ggml-ci
* cuda : fix bin bcast when src1 and dst have different types
ggml-ci
---------
Co-authored-by: slaren <slarengh@gmail.com>
* convert : support Mixtral as LLAMA arch
* convert : fix n_ff typo
* llama : model loading
* ggml : sync latest ggml_mul_mat_id
* llama : update graph to support MoE
* llama : fix cur -> cur_expert
* llama : first working version
* llama : fix expert weighting in the FFN
* ggml : ggml_get_rows support 2D indexing [n_tokens, n_experts] (cpu only)
* ggml : add n_as argument to ggml_mul_mat_id
* ggml : fix ggml_get_rows to take into account ne02 / ne11
* metal : add more general support for ggml_get_rows + tests
* llama : add basic support for offloading moe with CUDA
* metal : add/mul/div use general kernel when src1 not cont
* metal : reduce the kernel launches for ggml_mul_mat_id
* ggml : get_rows : support non-contiguos tensors with gaps, generalize up to 3D
* ggml : update get_rows f16 and q
* cuda : support non-contiguous src1 in get_rows
* llama : offload missing ffn_moe_silu
* metal : fix ggml_get_rows to work with non-cont src1
* metal : add indirect mat-vec kernels for all quantization types
* llama : do not quantize expert gating tensors
* llama : add n_expert and n_expert_used to hparams + change quants
* test-backend-ops : add moe test
* cuda : fix get_rows when ncols is odd
* convert : determine n_ctx correctly
* metal : fix ggml_mul_mat_id for F32
* test-backend-ops : make experts more evenly probable (test_moe)
* test-backend-ops : cleanup, add moe test for batches
* test-backend-ops : add cpy from f32 -> all types test
* test-backend-ops : fix dequantize block offset
* llama : fix hard-coded number of experts
* test-backend-ops : simplify and disable slow tests to avoid CI timeout
* test-backend-ops : disable MOE test with thread sanitizer
* cuda : fix mul_mat_id with multi gpu
* convert : use 1e6 rope_freq_base for mixtral
* convert : fix style
* convert : support safetensors format
* gguf-py : bump version
* metal : add cpy f16 -> f32 kernel
* metal : fix binary ops for ne10 % 4 != 0
* test-backend-ops : add one more sum_rows test
* ggml : do not use BLAS with ggml_mul_mat_id
* convert-hf : support for mixtral-instruct (#4428)
* convert : typo fix, add additional hyperparameters, use LLaMA arch for Mixtral-instruct
* convert : use sentencepiece tokenizer for Mixtral-instruct
* convert : make flake8 happy
* metal : fix soft_max kernels
ref: 1914017863
* metal : limit kernels to not use more than the allowed threads
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Radek Pilar <github@mrkva.eu>
llama_context_params.logits_all is a parameter for controlling
llama_eval. This documents that logits_all should not be used with
llama_decode and llama_batch.
On commit b1108 (44c117f4) xaedes added
ggml_allocr * alloc = NULL;
... (many lines in between)
if (alloc) {
ggml_allocr_free(alloc);
}
Which is correct, but it's easy to lose context after many lines in between.
On commit b1287 (0e76a899) xaedes made a big change. From here on, alloc is freed eagerly.
alloc = ggml_allocr_new(...)
... (short lines of code)
ggml_allocr_free(alloc)
This happens a few times, but alloc is never set to NULL, and many lines below,
we still have
if (alloc) {
ggml_allocr_free(alloc);
}
which causes a double-free.
* reserve space for codepoints
* improvement for the appended 0
* used precomputed token text for grammar sample
* reserve canidates_decoded
* reserve canidates_grammar
* remove candidates_decoded
* Revert "remove candidates_decoded"
This reverts commit 3773328080.
* changed decode_utf8 to take src by ref
* feat: Allow overriding GGUF metadata when loading model
* Fix the one time GCC is stricter than clang about something
* Step1
* Refactor... basically everything!
* Nuke obsolete GetArrayLen struct
* simplify std::string specialization
* Various cleanups
Add informational output when overrides are applied
Warn user when an override with the wrong type is specified
* Fix broken logic for parsing bool KV overrides
Fix issue where overrides didn't apply when key missing in GGUF metadata
Resolve merge changes
* llama : rearrange model params
* Update new GET_KEY call
Add note that metadata KV overrides aren't reflected in initial metadata KV info dump
---------
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Samplers sequence order w parameter
* Cleaned commented code
* Fixed formatting
* Rewrote with unordered_map
* Revert and rewrite, too many problems and safeguards would be needed
* Fixed code style
* Code style fixes according to review
* More readable samplers input string, fixed help
* Style fix in sampler_queue
* Formatting fixes
* Fixing whitespaces
This commit updates the error message that is printed when the
KV cache is not big enough to hold all the prompt and generated
tokens. Specifically it removes the reference to n_parallel and
replaces it with n_len.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
This commit adds a requirements file for the convert-hf-to-gguf.py
script, and also add the torch and transformers packages to it.
The motivation for this is that currently running convert-hf-to-gguf.py
will produce the following error:
```console
$ python3 -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
Collecting numpy==1.24.4
Collecting sentencepiece==0.1.98
Collecting gguf>=0.1.0
Installing collected packages: sentencepiece, numpy, gguf
Successfully installed gguf-0.5.1 numpy-1.24.4 sentencepiece-0.1.98
(venv) $ python convert-hf-to-gguf.py --help
Traceback (most recent call last):
File "llama.cpp/convert-hf-to-gguf.py", line 16, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
```
With this commit, and using requirements-hf-to-gguf.txt instead of
requirements.txt, the script can be run and shows the help output.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* metal : implement soft_max_ext
* cuda : implement soft_max_ext
* ggml : implement soft_max_ext (CPU)
* batched-bench : print threads
ggml-ci
* metal : simplify soft_max encoding
ggml-ci
* cuda : use 512 threads for soft_max instead of 32
* ggml : update soft max cpu
* cuda : do warp-based block reduce
* cuda : increase max block size to 1024
* cuda : fix warp reduction initialization of shared mem
* metal : warp-based reduction for soft max kernel
* metal : warp-based reduce for rms_norm
* metal : simplify soft max kernel
ggml-ci
* alloc : fix build with debug
* * add multiprompt support
* * cleanup
* * more cleanup
* * remove atomicity of id_gen, and change lock_guard to unique_lock on completion requests
* * remove all references to mutex_multitasks
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update examples/server/server.cpp
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* * change to set
---------
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* cmake : fix joining of REAL_GIT_DIR
* fix includes with help from include-what-you-use
* make : remove unneeded deps and add test-rope target
* fix C includes in C++ source files
* Revert "fix includes with help from include-what-you-use"
This reverts commit 635e9fadfd.
* ShareGPT4 compatibility (vision encoder only loading)
Load only a CLIP vision encoder (as supplied by ShareGPT finetunes)
Corrects the argument parsing for --img_mean and --img_std (which were previously not parsed but attempted to access)
Defines defaults for img_mean and img_std which are equal to the llava 1.5 CLIP encoder, so you do not have to provide them
* Update convert-image-encoder-to-gguf.py
* llama: fix alignment of general.name in print meta
This commit fixes the alignment of the general.name field in the
llm_load_print_meta function.
Currently the output looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
And with this commit it looks like this:
```console
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name = LLaMA v2
```
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* llama: fix alignment of special tokens
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
---------
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Typical sampling was broken because after copying new_candidates into canditates, the "sorted" bool is left at "true", but the new data is no longer sorted according to probability. Patch to set "sorted" to false.
Test: Generating with temp=0.0001 (approx. argmax) should generate the same sequence at typical>=1.0 and typical=0.9999 (approx. disabled, but enters the typical sampling codepath).
* fix oai proxy
fix generation not stoped while bot stop talking in chat mode
fix possible `slot_id` not exist
response for cors (and pre flight)
* oai proxy: workaround for some client (such as Chatbox)
* use stop as separator to replace hardcoded `\n`
* ggml : use blas even if src0 is not F32
* llama : use n_threads_batch only when n_tokens >= 32
ggml-ci
* llama : revert n_threads_batch logic
ggml-ci
* copy to llama.cpp as subdir
* attempt enabling metal, fails
* ggml metal compiles!
* Update README.md
* initial conversion to new format, utf8 errors?
* bug fixes, but now has an invalid memory access :(
* added O3, now has insufficient memory access
* begin sync with master
* update to match latest code, new errors
* fixed it!
* fix for loop conditionals, increase result size
* fix current workflow errors
* attempt a llama.swiftui workflow
* Update .github/workflows/build.yml
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add openai-compatible POST /v1/chat/completions API endpoint to server example
* fix code style
* Update server README.md
* Improve server README.md
* Fix server.cpp code style according to review
* server : some style changes
* server : indentation
* server : enable special tokens during tokenization by default
* server : minor code style
* server : change random string generator
* straightforward /v1/models endpoint
---------
Co-authored-by: kir-gadjello <111190790+kir-gadjello@users.noreply.github.com>
Co-authored-by: Tobi Lütke <tobi@Tobis-MacBook-Pro.local>
* llama : keep track of used KV cells + better KV cache management
* llama : zero KV cache used upon clear
ggml-ci
* llama : allow exporting a view of the KV cache (#4180)
* Allow exporting a view of the KV cache
* Allow dumping the sequences per cell in common
* Track max contiguous cells value and position as well
* Fix max contiguous empty cells index calculation
Make dump functions deal with lengths or sequences counts > 10 better
* Fix off by one error in dump_kv_cache_view
* Add doc comments for KV cache view functions
Eliminate cell sequence struct; use llama_seq_id directly
Minor cleanups
* common : add -dkvc arg for enabling kv cache dumps
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
Disabled rules:
* E203 Whitespace before ':' - disabled because we often use 'C' Style where values are aligned
* E211 Whitespace before '(' (E211) - disabled because we often use 'C' Style where values are aligned
* E221 Multiple spaces before operator - disabled because we often use 'C' Style where values are aligned
* E225 Missing whitespace around operator - disabled because it's broken so often it seems like a standard
* E231 Missing whitespace after ',', ';', or ':' - disabled because we often use 'C' Style where values are aligned
* E241 Multiple spaces after ',' - disabled because we often use 'C' Style where values are aligned
* E251 Unexpected spaces around keyword / parameter equals - disabled because it's broken so often it seems like a standard
* E261 At least two spaces before inline comment - disabled because it's broken so often it seems like a standard
* E266 Too many leading '#' for block comment - sometimes used as "section" separator
* E501 Line too long - disabled because it's broken so often it seems like a standard
* E701 Multiple statements on one line (colon) - broken only in convert.py when defining abstract methods (we can use# noqa instead)
* E704 Multiple statements on one line - broken only in convert.py when defining abstract methods (we can use# noqa instead)
* Support special tokens and not adding BOS to prompt in speculative
* Adapt to new should_add_bos function
* Ensure tgt and dft have same add_bos setting
* ggml-cuda.cu: Clean up warnings when compiling with clang
* ggml-cuda.cu: Move static items into anonymous namespace
* ggml-cuda.cu: Fix use of namespace start macro
* Revert "ggml-cuda.cu: Fix use of namespace start macro"
This reverts commit 26c1149026.
* Revert "ggml-cuda.cu: Move static items into anonymous namespace"
This reverts commit e29757e0f7.
* build: support ppc64le build for make and CMake
* build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
- introduces help entry for the argument
- cuts '--gpu-layers' form in order to simplify usage and documentation.
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
* Remove logically superfluous assertions and order by dimension
* Use cblas_sgemm() to implement ggml_compute_forward_out_prod()
* Remove ggml_compute_forward_out_prod_use_blas(), fix compiling errors on cmake/zig, remove trailing whitespace
* Add openBLAS support for sgemm() in compute_forward_out_prod()
* finetune : zero the loraB initial vectors
Without this, the first iteration is starting out far from the base model, instead of exactly on it.
Zeroing loraB is what the paper recommends. loralib also zeroes at least one of the init vector pairs
(though it departs from the paper in using a different distribution for the other vector, in some cases).
* tabs to spaces
* Use ggml_set_zero instead of adding a new function
* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.
* Respect add_bos_token GGUF metadata value
* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
* add safetensors to convert.py help message
* Check for single-file safetensors model
* Update convert.py "model" option help message
* revert convert.py help message change
* gguf-py: Refactor and add file reading support
* Replay changes from #3871
Credit to @cebtenzzre for that pull
* Various type annotation fixes.
* sort imports with isort (again)
* Fix missing return statement in add_tensor
* style cleanup with flake8
* fix NamedTuple and Enum usage
* Fix an issue with state init in GGUFReader
Move examples to an examples/ directory
Clean up examples
Add an example of modifying keys in a GGUF file
Update documentation with info on examples
Try to support people importing gguf/gguf.py directly
* Damagage is not a word.
* Clean up gguf-py/examples/modify_gguf.py whitespace
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update gguf-py/examples/modify_gguf.py formatting
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update gguf-py/gguf/gguf_reader.py type hint
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Make examples executable, formatting changes
* Add more information to GGUFReader and examples comments
* Include a gguf Python package version bump
* Add convert-gguf-endian.py script
* cleanup
* gguf-py : bump minor version
* Reorganize scripts
* Make GGUFReader endian detection less arbitrary
* Add JSON dumping support to gguf-dump.py
Which I kind of regret now
* A few for gguf-dump.py cleanups
* Murder accidental tuple in gguf-py/scripts/gguf-dump.py
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* cleanup
* constants : remove unneeded type annotations
* fix python 3.8 compat
* Set up gguf- scripts in pyproject.toml
* And include scripts/__init__.py, derp
* convert.py: We can't currently support Q8_0 on big endian.
* gguf-py: SpecialVocab: Always try available sources for special token ids
gguf-py: SpecialVocab: Try to load merges from merges.txt if not in tokenizer.json
gguf-py: SpecialVocab: Add 'add_bos_token' type bools to GGUF metadata
u
* cleanup
* Promote add_X_token to GGUF metadata for BOS and EOS
---------
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Update server.cpp with min_p after it was introduced in https://github.com/ggerganov/llama.cpp/pull/3841
* Use spaces instead of tabs
* Update index.html.hpp after running deps.sh
* Fix test - fix line ending
* fix backward process of rope
rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions.
the code for the backward process is nearly identically to the forward process:
the only difference is the sign of the sin-values.
to avoid future regressions remove the near-duplicate backward functions and reuse the forward code:
for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`.
the sin-values will be negated when forward is false.
* fix finetune rope call to use correct default attn_factor of 1.0f
* remove unused `ggml_rope_xpos_back`
it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants.
* fix comments explaining the sinus sign in ggml_forward_rope
* add missing function arguments in declaration
* fix function argument type in declaration
llava-cli was loading models with default params and ignoring settings
from the cli. This switches to a generic function to load the params
from the cli options.
* wip llava python bindings compatibility
* add external llava API
* add base64 in-prompt image support
* wip refactor image loading
* refactor image load out of llava init
* cleanup
* further cleanup; move llava-cli into its own file and rename
* move base64.hpp into common/
* collapse clip and llava libraries
* move llava into its own subdir
* wip
* fix bug where base64 string was not removed from the prompt
* get libllava to output in the right place
* expose llava methods in libllama.dylib
* cleanup memory usage around clip_image_*
* cleanup and refactor *again*
* update headerdoc
* build with cmake, not tested (WIP)
* Editorconfig
* Editorconfig
* Build with make
* Build with make
* Fix cyclical depts on Windows
* attempt to fix build on Windows
* attempt to fix build on Windows
* Upd TODOs
* attempt to fix build on Windows+CUDA
* Revert changes in cmake
* Fix according to review comments
* Support building as a shared library
* address review comments
---------
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
* Add detection code for avx
* Only check hardware when option is ON
* Modify per code review sugguestions
* Build locally will detect CPU
* Fixes CMake style to use lowercase like everywhere else
* cleanup
* fix merge
* linux/gcc version for testing
* msvc combines avx2 and fma into /arch:AVX2 so check for both
* cleanup
* msvc only version
* style
* Update FindSIMD.cmake
---------
Co-authored-by: Howard Su <howard0su@gmail.com>
Co-authored-by: Jeremy Dunn <jeremydunn123@gmail.com>
* Revert "cuda : add ROCM aliases for CUDA pool stuff (#3918)"
This reverts commit 629f917cd6.
* Revert "cuda : use CUDA memory pool with async memory allocation/deallocation when available (#3903)"
This reverts commit d6069051de.
ggml-ci
* Using cuda memory pools for async alloc/dealloc.
* If cuda device doesnt support memory pool than use old implementation.
* Removed redundant cublasSetStream
---------
Co-authored-by: Oleksii Maryshchenko <omaryshchenko@dtis.com>
* cmake : fix build when .git does not exist
* cmake : simplify BUILD_INFO target
* cmake : add missing dependencies on BUILD_INFO
* build : link against build info instead of compiling against it
* zig : make build info a .cpp source instead of a header
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* cmake : revert change to CMP0115
---------
Co-authored-by: Matheus C. França <matheus-catarino@hotmail.com>
* Add '-ngl' support to finetune.cpp
* Add fprintf in ggml_cuda_op_add
When I tried CUDA offloading during finetuning following the readme, I got an assert here.
This probably isn't an important case because inference later gives a warning saying you should use f16 or f32 instead when using lora
* Add 'finetune.sh', which currently fails when using GPU
"error: operator (): Finetuning on tensors with type 'f16' is not yet supported"
* tweak finetune.sh
* Suppress some warnings in ggml.c
* Add f16 implementation to ggml_compute_forward_add_f16_f32
* Add an f16 case to ggml_add_cast_impl and llama_build_lora_finetune_graphs
* finetune.sh: Edit comments
* Add "add_f16_f32_f32_cuda"
* Tweak an error message
* finetune.sh: Add an optional LLAMA_MODEL_DIR variable
* finetune.sh: Add an optional LLAMA_TRAINING_DIR variable
* train : minor
* tabs to spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Extend llama_kv_cache_seq_rm to allow matichng any sequence
* Replace llama_kv_cache_tokens_rm with llama_kv_cache_clear
Use llama_kv_cache_clear for cache clearing
Change calls to llama_kv_cache_tokens_rm that want to delete by position to use llama_kv_cache_seq_rm functionality
* Try cwd for ggml-metal if bundle lookup fails
When building with `-DBUILD_SHARED_LIBS=ON -DLLAMA_METAL=ON -DLLAMA_BUILD_SERVER=ON`,
`server` would fail to load `ggml-metal.metal` because `[bundle pathForResource:...]`
returns `nil`. In that case, fall back to `ggml-metal.metal` in the cwd instead of
passing `null` as a path.
Follows up on #1782
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Allow quantizing k-quants to fall back when tensor size incompatible
* quantizing: Add warning when tensors were incompatible with k-quants
Clean up k-quants state passing a bit
* cuda : prints wip
* cuda : new cublas gemm branch for multi-batch quantized src0
* cuda : add F32 sgemm branch
* cuda : fine-tune >= VOLTA params + use MMQ only for small batches
* cuda : remove duplicated cuBLAS GEMM code
* cuda : add CUDA_USE_TENSOR_CORES and GGML_CUDA_FORCE_MMQ macros
* build : add compile option to force use of MMQ kernels
* cmake : add helper for faster CUDA builds
* batched : add NGL arg
* ggml : skip nops in compute_forward
* cuda : minor indentation
* cuda : batched cuBLAS GEMMs for src0 F16 and src1 F32 (attention ops)
* Apply suggestions from code review
These changes plus:
```c++
#define cublasGemmBatchedEx hipblasGemmBatchedEx
```
are needed to compile with ROCM. I haven't done performance testing, but it seems to work.
I couldn't figure out how to propose a change for lines outside what the pull changed, also this is the first time trying to create a multi-part review so please forgive me if I mess something up.
* cuda : add ROCm / hipBLAS cublasGemmBatchedEx define
* cuda : add cublasGemmStridedBatchedEx for non-broadcasted cases
* cuda : reduce mallocs in cublasGemmBatchedEx branch
* cuda : add TODO for calling cublas from kernel + using mem pool
---------
Co-authored-by: Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
We still have the heads up in `README.md` regarding `bpe` tokenizers and this patch is needed for
- a couple of tokenizer tests
- some more `special` and `non-special` added tokens handling (as far as I understand it)
* Update special token handling
* Add mpt
* added `llama_model_token_*` variants to all the `llama_token_*` functions.
* added `LLAMA_API`
* formatting
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* removed old `llama_token` functions
* changed 3 more functions to take in model
- `llama_token_get_text`
- `llama_token_get_score`
- `llama_token_get_type`
* added back docs
* fixed main.cpp
* changed token functions to use new model variants
* changed token functions to use new model variants
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* implementing parallel decoding in server example
* crash fixed
* save dev progress
* refactored sampling function
* completion endpoint working
* multiple client support
* grammar + no stream completion
* cached prompt support
* chat.mjs support cached prompt + some fixes
* server ui now support multiple clients
* unused change reverted
* fixed timings per slot
* add context swap
* add changes to README.md
* llava multimodal integration
* fixed tokens probs
* add multimodal input - alfa
* refactor code + remove unused comments + improved README.md
* fix compilation errors with llvm
* notify the user from server ui that multimodality is unavialable
* some ci fixes
* fix ci make build undefined ref errors
* fix long prompt than ctx proposed in #3639
* fixed premature end due stop word
* context shift fixed
* fix llava implementation
* sync README.md changes
* readme change
* update api like OpenAI
* multimodal support enabled by default
* fix make bui;d errors
* fix multiple clients
* fix zig build
* new sampling API
* latest changes of sampling API
* server : coding-style normalization
* server : coding-style normalization (part 2)
* server : remove beam-search functionality
* server : bug fix in ingest_images
n_tokens is incremented internally by llama_batch_add
* server : use refs + use llama_batch_clear()
* server : snake case
* server : minor sync
* added thread safe pipeline
* server : bach has to be allocated for n_parallel sequences
* server : no need for atomic int - already using mutex
* server : logs + minor code style
* server : fix multibyte handle in partial response (#3706)
* fix image load + view image in chat
* make : silence stb warnings
* clip : link to ggml, not to llama
* server : fix switch fallthrough
* server : fix crash in Debug on macOS (I have no idea why this fixes it!?)
* server : refactor ctx_sampling init + n_ctx + names
* server : bug fix for prompt caching
* Do not save/load image_data to localStorage
* editorconfig : new line in index.html
* server : completion requests remember slot_id
* Update readme to document multimodal in server
* server : minor style
* Update readme to document multimodal in server
* server : hide ctx_sampling->prev behind API (#3696)
* server : apply fix from #3722
* server : fix slot reuse
* server : add comment about changing slot_state to bool
---------
Co-authored-by: FSSRepo <go778sgt@gmail.com>
Co-authored-by: Damian Stewart <d@damianstewart.com>
Co-authored-by: Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
Co-authored-by: M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
* Add validation for special token ids to llama.cpp
Small optimization for llama_byte_to_token SPM mode
* Fix BPE newline check, only I could break something so simple
* Killll meeeeee
* Account for GGUF_KEY_KEY only setting when the key exists
* Minor code cleanups.
* Fix convert.py error msg when added tokens are out of range
* Make gguf SpecialVocab vocab size-aware
Update conversion scripts accordingly
* Avoid a string copy
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* infill tokens correction
* serverinfill tokens correction
* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape
* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape
* only rm when params.escape, rm space if possible which is added back or rm added space token
* only rm when params.escape, rm space if possible which is added back or rm added space token
* Revert "only rm when params.escape, rm space if possible which is added back or rm added space token"
This reverts commit 63ba0b621f.
* fix interactive prompt escaping and fix server infill leading space handling
* rm unnecessary bool check
* process escapes for neg prompt and interactive consec prompts
* removed unneccessary static string escape
* check whether platform is 390x if yes->do not import immintrin.h
* support s390x big endian
* support --bigendian option for s390x
1. verified with baichuan7b-chat with float 16 on s390x
2. verified with baichuan7b-chat
3. verified with chinese-alpaca-2-13b-f16
* update format based on editor-config checker result
* Update convert-baichuan-hf-to-gguf.py
* 1. check in ggml.c if endianess is not match
2. update GGUF version
3. change get_pack_prefix to property
4. update information log
* always use "GGUF" as beginng of GGUF file
* Compare "GGUF" with file header char by char
1. Set GGUF_MAGIC to "GGUF" string instead of int value
2. Compare "GGUF" char by char to ensure its byte order
3. Move bytes swap code from convert.py to gguf.py write_tensor_data
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : implement dequantize_q5_0
* metal : block_q_n_dot_y for block_q5_0 (broken)
* metal : revert unnecessary change
* metal : implement dequantize_q5_1
* metal : block_q_n_dot_y for q5_1 (broken)
* metal : fix block_q_n_dot_y
* minor : spaces / formatting
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added documentation of JSON return value of /completion endpoint
* Update examples/server/README.md
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Rewrite special token handling from #1931
* shorten param name, add st verification by type
* use offsets instead of copy by substr
* formatting, remove copying iterator on delete
* llama : normalize code-style
* swift fix
* print pfx/sfx if verb, main: split pfx input sfx
* dont add space when using special tokens
* minor : comment + spacing
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit removes `n_threads` from the `llama_decode_internal`
functions doc comment as it does not exist anymore.
It looks like this parameter was removed in
Commit 16bc66d947 ("llama.cpp : split
llama_context_params into model and context params").
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* WIP: start implementing LLaVA
* rm scratch buf for now, will revert after cleanup
* LLaVA image encoder is working. will combine with llama
* Add llava inference code, but it's buggy. debugging
* LLaVA is working e2e, needs to optimize memory allocation + cleanup
* Use ggml_allocr + rm unnecessary code
* fix: crlf -> lf
* fix: new line at EoF
* fix: trailing whitespace
* Add readme
* Update readme
* Some cleanup
* Are you happy editorconfig?
* rm unused batch image preprocessing
* rm unused import
* fix: rm designated initializers
* introduce pad-to-square mode for non-square images
* are you happy editorconfig?
* gitignore /llava
* Handle cases where image file does not exist
* add llava target to Makefile
* add support for 13b model variant
* Maybe seed is unlucky?
* Check if apples are compared to apples
* are you happy editorconfig?
* Use temperature = 0.1 by default
* command line: use gpt_params_parse()
* minor
* handle default n_predict
* fix typo
* llava : code formatting, rename files, fix compile warnings
* do not use Wno-cast-qual for MSVC
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix mirostat state when using multiple sequences
* Fix mirostat by completely refactoring sampling!
* Try to fix zig build.
* Export function to fetch/create default sampler states
Code formatting cleanups and add some comments
Silence a warning about id not being used when logging is disabled
* Apply some renaming suggestions.
Fix comments that were out of sync with the pull.
* Use more consistant naming convention for sampling contexts
* CUDA: added support for ggml_clamp (see also: https://github.com/ggerganov/ggml/issues/545)
* mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt
* mpt : protect against "clip_qkv": null in mpt-7b
* mpt : quick fix to avoid "Strange model" warning when quantizing MPT models
* mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
* mpt : standardized all tensor names to follow GGUF spec
* mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code
* mpt : fixed comment s/gptneox/mpt/
* mpt : remove tabs, trailing whitespace
* mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt
* mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252
* comment out n_past instead of marking it unused
* mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"]
* mpt : remove unused tokenizer_json in convert script
* ggml : remove obsolete n_past assert in ggml_alibi
* llama : print clam_kqv and max_alibi_bias hparams
---------
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* infill tokens correction
* serverinfill tokens correction
* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape
* removing any leading whitespace from infill suffix and removing leeading space token from suffix when params.escape
* only rm when params.escape, rm space if possible which is added back or rm added space token
* only rm when params.escape, rm space if possible which is added back or rm added space token
* Revert "only rm when params.escape, rm space if possible which is added back or rm added space token"
This reverts commit 63ba0b621f.
* fix interactive prompt escaping and fix server infill leading space handling
* rm unnecessary bool check
* metal : improve decoding speed for batches of 2-16
* metal : rename kernels mul_mat_ to mul_mv_
* metal : indentations
* minor
* metal : print more GPU info + disable mul_mm for MTLGPUFamiliy < Apple7
* kv cache slot search improvements
* Use n_ctx in kv find slot for consistency
* Ensure kv cache head points to a valid slot in llama_decode internal
* Add some comments to prevent dumb people (like me) from getting confused.
Fix uploading tensor data to device, including 3D, 4D, and non-contiguous tensors.
Use correct offsets into data that is already in VRAM.
Correct handling of OpenCL events when multiple commands are queued.
* sync : ggml (conv 1d + 2d updates)
ggml-ci
* ggml : fix UB in q5_0 and q5_1 quantize code
ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
ggml-ci
* tests : fix UB in test-quantize-perf
* Added RVV intrinsics support for Q8 quantize row and also improved the existing dot product function for risc-v.
The RVV intrinsics is added for the following quantize row functions
quantize_row_q8_0
quantize_row_q8_1
The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1
ggml_vec_dot_q4_0_q8_0
ggml_vec_dot_q4_1_q8_1
ggml_vec_dot_q5_0_q8_0
ggml_vec_dot_q5_1_q8_1
And vector initialization in Q5 by temporary array is also replaced by the vid intrinsics
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
* Added RVV intrinsics support for k_quants
This adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64
ggml_vec_dot_q2_K_q8_K
ggml_vec_dot_q3_K_q8_K
ggml_vec_dot_q4_K_q8_K
ggml_vec_dot_q5_K_q8_K
ggml_vec_dot_q6_K_q8_K
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
---------
Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
* fix LLAMA_NATIVE
* syntax
* alternate implementation
* my eyes must be getting bad...
* set cmake LLAMA_NATIVE=ON by default
* march=native doesn't work for ios/tvos, so disable for those targets. also see what happens if we use it on msvc
* revert 8283237 and only allow LLAMA_NATIVE on x86 like the Makefile
* remove -DLLAMA_MPI=ON
---------
Co-authored-by: netrunnereve <netrunnereve@users.noreply.github.com>
* Work on the BPE tokenizer
Tokenizer tests work for Falcon-7B
* Try to fix build problem
* Fix debug assertion failure
* Fix MSVC Unicode BOM problem
* Cleanup and an improvement
* Fix compiler warning
* Cleanup
* Test doesn't work over the full range of Unicodes
* Update .gitignore and Makefile
* Another Makefile rule
* Testing Aquila
* Moving byte decoding back to `token_to_piece` ...
... because everyone is using it.
* Guarding some unusable code pathes
* Streamlining code and adding some more assertions
Important change: I'm classifying added tokens as control tokens now for BPE.
* Adding a comment
* Adding another assertion
* Fixed vocabulary guarding assertions
* Fix PR for recent change
* Fix PR for recent change
* Fix for compiler warning
* Fix PR for recent change
* Fix PR for recent change
* Fix PR for recent change
* Fix for compiler warning
* Fixes for more compiler warnings
* Remove unused code
* Fix initialization of static maps
* Add scores and token types back, adapt gptneox
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update unicode.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update unicode.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Ported Starcoder and added some assertions
* Fix coding style
* Apply @jploski 's fix for missing tokens
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vvhg-code-infill (#1)
* infill in separate example (#2)
* reverted changes to main and added infill example
* cleanup
* naming improvement
* make : add missing blank line
* fix missing semicolon
* brought infill up to current main code
* cleanup
---------
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
* ggml-cuda : perform cublas matrix multiplication of quantized types as fp16
* rename CC_TURING to CC_VOLTA
* disable fp16 mat mul completely with multi GPU
* llama.cpp : split llama_context_params into model and context params
ggml-ci
* fix metal build
* fix freq_base/scale default to model value
* llama-bench : keep the same model between tests when possible
* move n_threads to llama_context_params, add n_threads_batch
* fix mpi build
* remove kv_size(), cuda scratch fixes
* remove low-vram option
* add n_threads_batch to system info, refactor to get_system_info()
* add documentation about --threads-batch to the READMEs
* llama-bench fix
* main : fix rope freq/scale warning
* llama.cpp : add llama_get_model
common : add llama_tokenize from model
* remove duplicated ctx/model functions
ggml-ci
* cuda : print total VRAM used
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add API functions to access llama model tensors
* add stub example for finetuning, based on train-text-from-scratch
* move and remove code
* add API functions to access remaining model parameters:
mult, head and rot
* first draft for LORA finetune training
* remove const model and layer arguments in API functions for accessing model tensors
* bug fixes to make finetune compile
automatic allocator does not work yet
* add debug prints for training memory improvements
* fix names of lora tensors
* avoid stack overflow resulting from big ggml_cgraph
replace stack allocation and ggml_build_forward by ggml_new_graph in combination with ggml_build_forward_expand
* replace llama API functions to get model tensors by one function to get model tensor by name
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
* remove unused call to not existing llama_get_layer_from_model
* implement ggml_compute_forward_out_prod_q_f32
* remove trailing whitespace
* add lora finetune support on quantized base model tensors
* add ggml_add_cast API function
this function works like ggml_add, but accepts a data type for the resulting tensor.
only supported for quantized src0 input.
* use ggml_add_cast in finetuning
lora-applied weights will now have data type F32, which improves gradients when finetuning quantized base models
* bug fix: actually use result type passed to ggml_add_cast
* make sure base model tensors data cannot be used in viewable operations
memory allocator would try to make lora application inplace on base model tensors.
since those are memory mapped this will result in memory access violations
* fix bug in ggml_out_prod which resulted in wrong n_dims of result tensors
* avoid keeping in memory ALL of the gradients
The problem here stems from ggml_graph_reset. This function is called in the optimization function, before each graph computation, to reset the gradients to zero. This required a unique memory slot for each gradient: allocating memory from a previosly freed memory location might lead to non-zero input gradients.
During ggml_compute_backward the gradients are build stepwise by adding or substracting new values, starting from a OP_NONE tensor which needs to contain zero-values. This requires the graph reset.
To avoid this I now remember in ggml_build_backward_expand the original OP_NONE gradient tensors in a hash table, which is passed to ggml_compute_backward. There instead of using add (or sub or similar) I test whether the existing gradient to be changed is a zero-valued-tensor by looking up its existence in the hash table. When it is such a zero-tensor it will not be modified, but replaced by the value to be added, otherwise the regular add (not inplace, allocator will take care of this) will be used. This way none of those zero-tensor values will be necessary in the final backward graph and more importantly they won't need a unique memory slot, just to make them zero.
* remove trailing whitespace
* remove debug prints and function to compute tensor data hash
* improve optimization iteration prints
* adjust maximal values to support finetuning 3B models
* change default finetune params lora_r and lora_alpha to match the n_rank parameters of 4
* bug fix: make sure finetune input gradient is allocated at begin and kept until end
* remove unnecessary src tensor from ggml_get_rows_back
we don't need data of src[2] for computation, only to setup the correct output shape.
remove dependency on src[2], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included.
this is similar to how ggml_reshape does it.
* remove unnecessary src tensor from ggml_repeat & ggml_repeat_back
we don't need data of src[1] for computation, only to setup the correct output shape.
remove dependency on src[1], so that allocator can work more freely.
the computational graph is still completely determined, because the output shape is naturally included
* resolve todo
allocator will only make it inplace when they are of the same type
* mixing multiple LORA adapters is now possible
pass more than one '--lora FNAME' argument to apply more than one LORA.
use '--lora-scaled FNAME S' when you want to specify a user-defined scale for an adapter.
* add option to save finetune output every N iterations
* also save latest finetune output with ITERATION="LATEST" and print where files are saved
saving with LATEST makes it easier to resume training from the latest checkpoint
the string "LATEST" can be configured with command line option "--fn-latest STR"
* update checkpoint train stats before saving via "--save-every"
* add command line option `--rank-wo N` for rank of wo tensor
* update finetune README
* fix dump_non_result_info_yaml to output multiple lora adapters
* bug fix: replace GGML_TYPE_SIZE[t] by ggml_type_size(t)
* replace llama_n_mult by llama_n_ff
* finetune bug fixes to compile with merged in code from master
* remove prediction related code to reduce duplicated code with main
use main instead
* reduce large memory overhead in train-text-from-scratch
all gradients had to be pinned so that graph_reset works correctly.
this is no longer necessary with the changes to ggml_compute_backward introduced in this PR.
* add comment explaining why finetune checkpoints are allocated in one block
* make default value of float member a float literal
* handle rms_norm and rope parameters the same as in train-text-from-scratch
* remove unused code
* remove vocab related code as it is unnecessary
* add LLM_KV_TRAINING_TYPE to train-text-from-scratch checkpoints
so that they can be differentiated from lora finetune checkpoints
* add gguf constants and load/save functions from train-text-from-scratch
* add load & save lora finetune checkpoints via gguf
* add python script to convert old finetune checkpoint files to gguf
* remove old checkpoint save & load code
* remove code to print data checksums which was used to verify correctness of new gguf code
* omit tokenization when training is disabled, only save llama lora adapter
training can be disabled by passing '-n 0' to finetune
* remove trailing whitespace
* update README.md
* implement ggml_compute_forward_repeat_f16
* avoid stack overflow of large cgraphs in test-grad0
* add ggml API functions ggml_unravel_index, ggml_get_i32_nd and its analogs for set and for f32
ggml_get_i32_1d, ggml_set_i32_1d, ggml_get_f32_1d, ggml_set_f32_1d now support non-contiguous tensors.
in case of non-contiguous tensor, the 1d index is unraveled into a multi index using ggml_unravel_index to be passed to '_nd' function equivalent.
this fixes a bug in test-grad0 which happens due to ggml_build_backward not building purely contiguous tensors anymore
* increase test-grad0 context mem size to accommodate for bigger cgraph
* add sanity check to ggml_compute_backward, asserting the correct shape of gradients
* fix ggml_acc_or_set to return tensor of correct shape
* remove unused 'inplace' argument from ggml_compute_backward function
inplace operations to add gradients are no longer created by ggml_compute_backward
use allocator to automatically make inplace operations
* add missing argument 'int i0' to ggml_get_i32_nd & ggml_set_i32_nd header declarations
* fix error message in ggml_allocr_alloc to display actual max_avail
* fix check_gradient
ggml_build_backward_expand was previously replaced by ggml_build_backward, but the assignment of forward graph to backward graph missing
* use tensor->view_src instead of ggml_is_view and get_view_source
* move gradient checkpointing code into ggml, new API function:
// build gradient checkpointing backward graph gb for gf using provided checkpoints
// gb_tmp will contain original backward graph with rewritten backward process nodes,
// but without the second forward pass nodes.
GGML_API void ggml_build_backward_gradient_checkpointing(
struct ggml_context * ctx,
struct ggml_cgraph * gf,
struct ggml_cgraph * gb,
struct ggml_cgraph * gb_tmp,
struct ggml_tensor * * checkpoints,
int n_checkpoints);
* replace custom data getters and setters by ggml functions
* train-text-from-scratch can train (full finetune) gguf models
just pass the gguf model via `--checkpoint-in FN`.
after this, to continue training, pass the generated checkpoint instead of the original gguf model.
tested with smaller models, bigger models may exceed available memory.
use (LORA) finetune for those.
* remove trailing whitespace
* add option to save train-text-from-scratch output every N iterations
* update README.md
* fix warnings
* fix warnings
* remove finetune option to disable allocator
the allocator should always be used.
by making sure that it is always used it gets easier to implement automatic memory requirements computation
* add tensor checkpoints only when gradient checkpointing is enabled
* initialize opt ggml context if none was provided
* add ggml-alloc API function 'ggml_allocr_max_size' to get max size of alloc
GGML_API size_t ggml_allocr_max_size(struct ggml_allocr * alloc);
* finetune: automatically allocate all memory and changes to command line options
remove '--n_examples N' parameter, as it no longer makes sense to call optimization process multiple times in a loop.
add '--only_write_lora' command line option: will skip tokenization and training, to only write a llama.cpp comptabile LORA adapter.
remove memory buffer related command line options.
improve iteration console output.
* add finetune to Makefile
* update README.md
* print time per iteration and estimate remaining time
* increase measured alloc size by tensor_alignment
ggml_allocr_reset will reduce the given size by up to tensor_alignment-1
* fix README.md
* add some more allocator debug prints
* bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue
* revert last commit
"bug fix, probably solves the 'ggml_allocr_alloc: not enough space in the buffer' issue"
"alloc was freeing an externally allocated tensor, because it calculated the end of allocator memory as alloc->data + alloc->max_size instead of alloc->data + alloc->size."
This is intentional to reduce the risk of freeing external tensors when measuring. Unless max_size is not properly calculated, I don't see why this is an issue.
* remove unnecessary "0x" before "%p" output
* move measurement memory segment to upper region of the address space
* update README.md
* fix printf format warnings
* add missing gguf_free in load_checkpoint_lora_file
* load default rms_norm and rope parameters from base model
* add gradient accumulation
specify number accumulation steps with '--grad-acc N'.
this will simulate a bigger batch size of grad_acc*batch.
* fix tracking of train_samples and train_tokens
* build : fix compile warnings
* ggml : fix L-BFGS linesearch loop
* improve finetune time measurement
fix printf warnings on system where int64_t is (long int).
change time datatypes to double because values get big with long training times.
exclude file saving from time measurement.
converge faster to actual time per iteration by removing very small first duration before first iteration was performed.
fix bug in output of total training time, the reported value was 1000 times to small.
* specify default lora rank with '--lora-r N'
'--lora-r N' will specify default rank for all tensors
'--rank-wq N', etc. will override this default rank for specific tensor types.
* fix gradient accumulation bug where the same batch was used for each microstep
* fix gradient accumulation bug where the same batch was used for each microstep
* support grouped-query-attention in ggml_flash_attn and ggml_flash_attn_back
k and v can now be repeated in q along ne[2]
in forward pass just use modulo to compute k and v indices, like ik2 = iq2 % nek2.
in backard pass this won't work as easy, because multiple threads will compete to accumulate to the same k->grad[:,ik1,ik2,ik3] and v->grad[:,iv1,iv2,iv3].
so we change the parallelization over q rows to be over k rows. this ensures non-overlapping (ik2,ik3) across threads.
in each thread we then iterate over the number of repetitions of k/v in q to compute iq2 as iq2 = ik2 + irep*nek2.
since ne2 is not the same for q,k and v we also change how the gradients are concatenated into the result tensor.
additionally the offsets of gradq, gradk and gradv in the result tensor are now memory aligned.
we also simplify the compute_backward part of flash_attn to use ggml_reshape instead of switching over the number of dimensions.
this needs a small change to ggml_reshape, removing the assertion of second argument to be contiguous.
since only the shape (ne) of the second reshape argument is of relevance, its memory layout (nb) is irrelevant -> it can very well be non-contiguous.
change test-grad0 to also test for repeated k/v in q.
this changes the rng and now results in small gradient differences in softmax. these solely come from using f16 exp table lookup in forward softmax: when temporarily changing softmax to use actual exp function, the reported gradient differences go away. gradient differences coming solely from f16 table lookup are acceptable.
added a note to explain this.
* add llama API functions to get grouped-query-attention n_head parameter 'n_head_kv'.
* fix finetune to support grouped-query-attention (using flash-attention)
note: ggml changes to ggml_out_prod are necessary to support grouped-query-attention without flash-attention.
* support broadcastable a in out_prod(a, b) and backward pass of broadcasting mul_mat(a, b)
* test broadcasting mul_mat backward pass
* decouple random number generator of each operation test
when changing one test the rng of others tests is not influenced anymore
* add comment briefly describing what ggml_repeat_back does
* simplify broadcasting mul_mat backward using ggml_repeat_back
* add cgraph evaluation order member and corresponding enum type
this controls in which order ggml_build_forward visits source nodes.
by default the nodes are visited left to right, i.e. src[0] first.
in some cases it is beneficial for ggml-alloc to visit in a different order.
two possible orders are supported: left-to-right (src[0] first) and right-to-left (src[0] last).
* measure max compute size for each cgraph eval order and use best order
this can bring huge memory savings:
e.g. codellama-34b with n_ctx=64, n_batch=1 goes from 92927.8mb down to 4627.6 MB
* remove unused command line options
* add sample start patterns and options to force new or by default resume last shuffling
* update shuffle rng state on reshuffle
* exclude known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* remove probably unnecessary exception type flags from stringstream
* pass correct max number of tokens to llama_tokenize
* account for possible leading whitespace that will be added by tokenizer
e.g. '\t' will be tokenized by llama spm tokenizer to [29871, 12]
* use unrolled vec_mad in out_prod
y is vec_mad result vec.
x is vec_mad input vec.
v is vec_mad input scalar.
ggml_vec_mad_f32_unroll will internally loop over x and v with same y.
GGML_VEC_MAD_UNROLL is by default defined to 32.
This value is empirical optimized using performance test runs of out-prod in openllama-3b finetune with 256 context length and batch size 1. It gives 23% performance boost for out_prod.
Full measurements of out-prod runtime in ms:
unroll_xv unroll_yv
1 67014.643 87826.469
2 77117.552 89077.656
4 72091.311 109121.657
8 61077.543 88678.334
16 56914.67 79514.947
24 59024.595 84350.254
28 55952.446 83368.73
32 51476.658 85177.745
36 55973.792 84659.92
40 55139.616 93844.738
48 60736.392 93330.267
64 99856.878 116994.99
Second column is when unrollying yv instead of xv
* set lora_alpha to value of lora_r if it is not set via command line
otherwise only changing lora_r will change scaling of lora adapter used in prediction
* reshuffle original sample order instead of the previous shuffled order
otherwise resumed reshuffle will not result in same sample order
* block tiling for out-prod inspired by mul-mat
block sizes are empirically optimized
roughly doubles the flops of out-prod
* exclude some more known zero values from computations in flash_attn_f32 & flash_attn_back_f32
* add static keywords
* remove outcommented old code
* update train-text-from-scratch with tokenization, sample selection and shuffling from finetune
* remove lbfgs related train parameters
* move common train functions into common/train.[h|cpp]
* move train state into struct train_state
* move train data saving code into callback to unify code of opt_callback
train_params are still different in finetune and train-text-from-scratch, so it can't yet be moved to train.h|cpp
* move common train params into common/train
* move common opt_callback into common/train
* fix consume_common_train_arg
* save and load head_count_kv in lora checkpoints
* increase train_samples by used_samples instead of number of batches
on batch can contain more than one sample when option "fill_with_next_samples" is used
* fix usage of llama_tokenize
* remove static from process_escape since we need it exposed in header
* fix code formating of long function declarations
* fix condition in load_train_state_gguf
* use die("msg") instead of replace GGML_ASSERT(!"msg") or throw std::runtime_error("msg")
* fix saving and loading of training type
* remove terminating '\0' from tokenization
(llama_tokenize is now passed the string length instead of relying on terminating '\0')
* fix compile warnings
* fix compile warnings
* use new/delete for train_state instead of malloc/free
using malloc may result in seg faults when trying to assign string fields
* assert that sample_count > 0, avoiding division by zero
* fix frand to return value in interval [0,1)
* add train option "--sample-random-offsets"
Use samples beginning at random offsets.
The offset is only applied to the first sample in each batch context window.
Together with "--fill-with-next-samples" this may help for training endless text generation.
For example given a dataset containing samples "abcd", "ABCD", "0123".
With context size of 8 and options "--fill-with-next-samples", "--no-separate-with-eos", "--no-separate-with-bos",
the context windows of batches could only be filled with "abcdABCD", "ABCDabcd", "0123abcd", etc.
With "--sample-random-offsets" it can also be filled with "23abcdAB", "bcd0123A", etc.
* deduplicate code into function
* remove n_rot hparam, as it must always be hparam.n_embd_head()
* align code
* assert correct base model tensor shapes
* move some params from lora hparams into model hparams and load model params from gguf
this equalizes the model definition in finetune and text-from-scratch and removes the need for additional llama api functions to get model parameters
* remove now unnecessary llama API functions to get model params that where added by this PR
* train-text-from-scratch: automatically allocate model tensors, remove option '--mem-model N'
* train-text-from-scratch: automatically allocate opt context
* train-text-from-scratch: automatically allocate input tensors
* train-text-from-scratch: automatically allocate compute memory
* remove unused options and equalize train-text-from-scratch with finetune
* initialize opt->loss_after with zero
* add export-lora program
* remove trailing whitespace
* add export-lora build in Makefile
* remove unused struct tensor_info from export-lora
* add export-lora build dependency to llama
because it depends on common, which depends on llama
* update finetune README.md
* cancel optimization when specified number of epochs is completed
* improve handling of export-lora arguments
print errors and warnings when files could not be read or created
* Fix export-lora.cpp "not enough space in the context's memory pool" (#1)
* Fix export-lora.cpp "not enough space in the context's memory pool"
Without this patch, export-lora would sometimes error with "not enough space in the context's memory pool (needed 656784, available 656800)".
* increase required context size by 5*GGML_MEM_ALIGN instead of plain 16
---------
Co-authored-by: xaedes <xaedes@gmail.com>
* improve handling of not yet supported tensor types
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: meatbag-18a <145869052+meatbag-18a@users.noreply.github.com>
* Resync my fork with new llama.cpp commits
* examples : rename to use dash instead of underscore
* New model conversions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix für #2721
* Reenable tokenizer test for LLaMa
* Add `console.cpp` dependency
* Fix dependency to `common`
* Fixing wrong fix.
* Make console usage platform specific
Work on compiler warnings.
* Adapting makefile
* Remove trailing whitespace
* Adapting the other parts of the makefile
* Fix typo.
* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1
* Simplify logic
* Add missing change...
* Fix ugly compiler warning
* llama_tokenize should accept strings containing NUL now
* Adding huichen's test case
* add placeholder of starcoder in gguf / llama.cpp
* support convert starcoder weights to gguf
* convert MQA to MHA
* fix ffn_down name
* add LLM_ARCH_STARCODER to llama.cpp
* set head_count_kv = 1
* load starcoder weight
* add max_position_embeddings
* set n_positions to max_positioin_embeddings
* properly load all starcoder params
* fix head count kv
* fix comments
* fix vram calculation for starcoder
* store mqa directly
* add input embeddings handling
* add TBD
* working in cpu, metal buggy
* cleanup useless code
* metal : fix out-of-bounds access in soft_max kernels
* llama : make starcoder graph build more consistent with others
* refactor: cleanup comments a bit
* add other starcoder models: 3B, 7B, 15B
* support-mqa-directly
* fix: remove max_position_embeddings, use n_train_ctx
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update llama.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: switch to space from tab
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : relax conditions on fast matrix multiplication kernel
* metal : revert the concurrnecy change because it was wrong
* llama : remove experimental stuff
* add freebsd to ci
* bump actions/checkout to v3
* bump cuda 12.1.0 -> 12.2.0
* bump Jimver/cuda-toolkit version
* unify and simplify "Copy and pack Cuda runtime"
* install only necessary cuda sub packages
* Keep static libs and headers with install
* Add logic to generate Config package
* Use proper build info
* Add llama as import library
* Prefix target with package name
* Add example project using CMake package
* Update README
* Update README
* Remove trailing whitespace
Enables the GPU enabled container images to be built and pushed
alongside the CPU containers.
Co-authored-by: canardleteer <eris.has.a.dad+github@gmail.com>
* Fix für #2721
* Reenable tokenizer test for LLaMa
* Add `console.cpp` dependency
* Fix dependency to `common`
* Fixing wrong fix.
* Make console usage platform specific
Work on compiler warnings.
* Adapting makefile
* Remove trailing whitespace
* Adapting the other parts of the makefile
* Fix typo.
* Minor speed gains for all quantization types
* metal: faster kernel_scale via float4
* Various other speedups for "small" kernels
* metal: faster soft_max vial float4
* metal: faster diagonal infinity
Although, to me it looks like one should simply
fuse scale + diagnonal infinity + soft_max on the
KQtensor.
* Another faster f16 x f32 matrix multiply kernel
* Reverting the diag infinity change
It does work for PP, but somehow it fails for TG.
Need to look more into it.
* metal: add back faster diagonal infinity
This time more carefully
* metal : minor (readibility)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Metal support for Swift
* update
* add a toggle for arm/arm64
* set minimum versions for all platforms
* update to use newLibraryWithURL
* bump version
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
---------
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
* Slightly faster Q3_K and Q5_K on metal
* Another Q3_K speedup on metal
Combined with previous commit, we are now +9.6% for TG.
PP is not affected as this happens via the matrix multiplication
templates.
* Slowly progressing on Q3_K on metal
We are now 13% faster than master
* nother small improvement for Q3_K on metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not use _GNU_SOURCE gratuitously.
What is needed to build llama.cpp and examples is availability of
stuff defined in The Open Group Base Specifications Issue 6
(https://pubs.opengroup.org/onlinepubs/009695399/) known also as
Single Unix Specification v3 (SUSv3) or POSIX.1-2001 + XSI extensions,
plus some stuff from BSD that is not specified in POSIX.1.
Well, that was true until NUMA support was added recently,
so enable GNU libc extensions for Linux builds to cover that.
Not having feature test macros in source code gives greater flexibility
to those wanting to reuse it in 3rd party app, as they can build it with
FTMs set by Makefile here or other FTMs depending on their needs.
It builds without issues in Alpine (musl libc), Ubuntu (glibc), MSYS2.
* make : enable Darwin extensions for macOS to expose RLIMIT_MEMLOCK
* make : enable BSD extensions for DragonFlyBSD to expose RLIMIT_MEMLOCK
* make : use BSD-specific FTMs to enable alloca on BSDs
* make : fix OpenBSD build by exposing newer POSIX definitions
* cmake : follow recent FTM improvements from Makefile
* metal : fix kernel_norm
ggml-ci
* metal : put warning in kernel_norm to not combine the loops
* metal : restore original F16 mat-vec multiplication
It works after the norm fixes
* common : don't do warm-up with more than n_batch tokens (close#3058)
ggml-ci
* metal : minor
* llama : use posix_madvise() instead of madvise() derived from BSD
sed -i 's,\<madvise\>,posix_&,g;s,\<MADV_,POSIX_&,g' llama.cpp
* ggml : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD
sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml.c
* metal : use sysconf(_SC_PAGESIZE) instead of getpagesize() derived from BSD
sed -i 's,getpagesize(),sysconf(_SC_PAGESIZE),g' ggml-metal.m
* convert-llama-ggmlv3-to-gguf: Try to handle files older than GGJTv3
* Better error messages for files that cannot be converted
* Add file type to GGUF output
* Rename to convert-llama-ggml-to-gguf.py
* Include original file type information in description
* Improve some informational output
* Guard against all weights in a super-block being zero
* Also guard against extremely small weights
Closes#2982
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* build : on Mac OS enable Metal by default
* make : try to fix build on Linux
* make : move targets back to the top
* make : fix target clean
* llama : enable GPU inference by default with Metal
* llama : fix vocab_only logic when GPU is enabled
* common : better `n_gpu_layers` assignment
* readme : update Metal instructions
* make : fix merge conflict remnants
* gitignore : metal
* ggml-alloc : use virtual memory for measurement
* compatibility fixes for MAP_ANONYMOUS
* fallback to fixed address for systems without virtual memory
* Very minor speedup via simd-group synchronization in f16 x f32
* Another very minor speedup on metal
* Quite significant PP speedup on metal
* Another attempt
* Minor
* Massive improvement for TG for fp16
* ~4-5% improvement for Q8_0 TG on metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* make : remove unused -DGGML_BIG_ENDIAN
* make : put preprocessor stuff in CPPFLAGS
* make : pass Raspberry Pi arch flags to g++ as well
* make : support overriding CFLAGS/CXXFLAGS/CPPFLAGS/LDFLAGS
* make : fix inverted conditional
* ggml_metal_init: Show all Metal device instances in the system
Also show the default Metal device that was picked.
* Update ggml-metal.m
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* k-quants : fix build on armv7
* ggml : cleanup unused arm32 specific impl
* k-quants : avoid some unused vzero / mzero define
* ggml-alloc : use 4g for MEASURE_MAX_SIZE in 32-bit arm
* Allow quantize tool to only copy tensors to allow repackaging models.
* Slightly better logic when requantizing.
* Change help message to go to `stdout`.
* llama2c : fix segfault if vocab is not found
* llama2c : fix mismatch between new[] and delete
* llama2c : fix basename on Windows
* llama2c : use a destructor to prevent memory leaks
* Somewhat faster f16 x f32 matrix multiply kernel
* Better use 32 thread groups for f16 x f32
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* convert : fix python 3.8 support
* convert : sort imports
* convert : fix required parameters in convert-llama-ggmlv3-to-gguf
* convert : fix mypy errors in convert-llama-ggmlv3-to-gguf
* convert : use PEP 585 generics and PEP 604 unions
Now that we have `from __future__ import annotations`, we can use this
modern syntax in Python 3.7 instead of restricting support to Python 3.9
or 3.10 respectively.
* gguf.py : a tuple is already a tuple
* add mypy.ini
* convert : add necessary `type: ignore` comments
* gguf-py: bump version
* build ci: run make test
* makefile:
- add all
- add test
* enable tests/test-tokenizer-0-llama
* fix path to model
* remove gcc-8 from macos build test
* Update Makefile
* Update Makefile
* convert: Fix permute calls and method/func definitions
* Cleanups for gguf-py
* Minor types cleanups.
* Initial implementation of handling merges and special tokens
* convert: Handle special tokens and merges in vocab only mode
convert: Vocab only mode no longer requires loading model tensors
* gguf: Refactor tensor name mapping
* convert: Fix type hint for special_token_types in SpecialVocab
* Use common special vocab handling in various conversion scripts
* First pass at implementing suggested changes
* Second pass
* gguf: SpecialVocab: Fix issue with special token content not in a dict
gguf: SpecialVocab: Allow skipping handling of merges
* convert-falcon-hf-to-gguf: Support --vocab-only option, bail out if no tokenizer.json
* convert-gptneox-hf-to-gguf and convert: Only handle merges for BPE tokenizer
* gguf: SpecialVocab: Actually set load_merges in object
* Uniform args parsing and vocab only mode for convert examples
* convert.py: Set gpt2 as tokenizer model when using BPE
* Squish last type warning in gguf.py - yay!
* tests : add a C compliance test
* make : build C compliance test by default
* make : fix clean and make sure C test fails on clang
* make : move -Werror=implicit-int to CFLAGS
* ggml : add view_src and view_offs
* update ggml-alloc to use view_src
* update ggml_diag_mask to work correctly with automatic inplace
* exclude other ops that set an inplace flag from automatic inplace
* make : do not pass headers to the compiler
This fixes building tests with clang.
* make : add missing examples
* make : fix build-info.h dependencies
* fix track_max_mem in forward_batch_wo_cache_flash_attn_train
* remove unnecessary Adam(W) optimizer tensors.
reduces optimizer memory overhead from 7*modelsize to 2*modelsize.
additionally allows to optimize models with more than 2^31 parameters by replacing int with int64_t.
bumps training checkpoint file version, but old checkpoints can still be read.
new version with less tensors is saved.
* add gradient clipping to AdamW
* Fix reset of unused g->nodes and g->grads to NULL
* implement gradient checkpointing for training
reduces memory overhead from O(n_layer) to O(sqrt(n_layer))
as explained in readme of https://github.com/cybertronai/gradient-checkpointing
* remove unused compute buffer 3
* add and use function ggml_build_backward_expand to avoid stack overflows with large maximum number of nodes
GGML_API void ggml_build_backward_expand(struct ggml_context * ctx, struct ggml_cgraph * gf, struct ggml_cgraph * gb, bool keep);
* change AdamW decay parameter to work like the torch AdamW decay parameter
It is now relative to Adam learning rate `alpha*sched`.
Before that it was relative to `sched` only.
`alpha` being the maximum learning rate and `sched` being a scaling parameter in [0..1]
* change default AdamW weight decay parameter used in training to 0.1 as used in nanoGPT
* change default AdamW weight decay parameter defined in ggml to 0.0, making Adam default instead of AdamW
btw: the default weight decay parameter for torch.optim.AdamW is 0.01
* bug fixes for cross entropy loss
ggml_cross_entropy_loss: sums where not correctly added in workload of each thread
ggml_cross_entropy_loss_back: simplify backward process, reducing numerical issues
guard usage of exp f16 lookup in cross entropy by #define GGML_CROSS_ENTROPY_EXP_FP16
cross entropy loss is only used once during training, but it is quite sensitive to numerical errors introduced by exp-f16-lookup.
so exp-f16-lookup for cross entropy loss is disabled by default, trading better gradients for very slightly worse runtime performance.
* fix test-grad0 for cross_entropy_loss
the second argument to cross_entropy_loss must sum up to 1 for each row
* fix test-grad0 for soft_max
dont use only sum as aggregation, because sum of softmax is always 1 -> finite differences should not work
instead use sum(log(soft_max()*(1-eps)+eps)); use eps to avoid log(0)
* improve finite differences of test-grad0 by using double instead of float
* change cross_entropy_loss to output average over all rows
this helps keeping the loss and gradients in a sane range
* improve gradient checkpointing
sqrt(n_layers) is only the best checkpoint step when mem size of checkpoints and mem size of layers are equal.
since layers require more memory than the single-tensor-checkpoint we use, the optimal values are compute different:
```
given: n, u, v
objective: minimize(a*u+b*v) where a*b=n, a>0, b>0
b=n/a
minimize(a*u+v*n/a)
diff(a*u+v*n/a, a) = u - (v*n/a)/a
diff(a*u+v*n/a, a) == 0
u - (v*n/a)/a == 0
u == v*n/(a*a)
u*a*a = v*n
a*a = v*n/u
a = sqrt(n*v/u)
```
this change results in more checkpoints, requiring less layers to store between checkpoints, overall improving memory usage.
* disable gradient checkpointing debug output
* llama : fix rope usage in train-text-from-scratch after ChatGLM change
* add more training parameters:
--enable-restart N Only for Adam optimizer. Enable restarts of cos-decay
--disable-restart N Only for Adam optimizer. Disable restarts of cos-decay
--opt-past N Number of optimization iterations to track for delta convergence test. Disabled when zero.
--opt-delta N Maximum delta for delta convergence test. Disabled when <= zero.
--opt-max-no-improvement N Maximum number of optimization iterations with no improvement. Disabled when <= zero.
--adam-epsf N AdamW epsilon for convergence test. Disabled when <= zero.
--adam-min-alpha N Adam minimum learning rate alpha, usually 0.1 * alpha
* replace memcpy with reshape operation so that the graph is not cut at the input
this makes it possible to store other values into the input tensor and then simply recompute the graph without rebuilding it
* remove unused function argument from get_example_targets_batch
* measure and print total training time
* add optimization callback to ggml_opt_resume_g
this callback is called before each iteration with custom data and pointer to learning schedule parameter (only used in Adam(W)).
can be used for dynamic learning schedule and setting input data for batches before each iteration
* use optimization callback in training
allows dynamic learning schedule and different batch data for each iteration without relying on low n_iter and high n_examples parameters
reduces runtime by avoiding restart of optimization function and improves training convergence by providing a different batch for each iteration
* add minimum number of tensor dimensions to apply weight decay (default 2)
this allows to not apply weight decay to bias parameters
* rename training parameter cos-decay-alpha to cos-decay-min and clarify that adam-min-alpha also applies to warmup
* fix increase of model.train_samples and model.train_tokens
now that each optimizer iteration gets its own batch we need to multiply by number of opt iterations
* change sampling parameters for prediction after training to defaults of common.h
and clarify what is context for prediction and what are generated tokens
* tighten abs error bounds for cross_entropy_loss in test-grad0
* add conditional compilation of using F16 exp in flash attention
uncomment `// #define GGML_FLASH_ATTN_EXP_FP16` to enable usage of f16 exp in flash attention
* tighten abs error bounds for flash_attn in test-grad0
* tighten abs error bounds for sqrt in test-grad0
* remove out-commented vectorized code of opt_adam
the vectorized code might be bit faster for low number of parameters, but it had a big memory usage overhead
* ggml : update ggml_rms_norm_back with configurable eps
* llama training : fix ggml_rms_norm_back calls to pass configurable eps
* remove trailing whitespace
* add train function using automatic gradient checkpointing backward pass and allocator
* in train function replace add_inplace by regular add
because using add_inplace seems to result in different gradients
* don't use allocate hash_map on context
because the context has no_alloc=True when using memory allocator resulting in NULL data pointers
* correctly clone reshape and permute operations by also cloning tensor->nb values
* fix variable name and add missing type cast
* terminate recursive tensor cloning when reaching tensor without src tensors
* correctly clone view tensors by setting data pointers
without this the checkpointing would only work when being used together with memory allocator
* fix variable names
* swap arguments to commutative ops to be the same as in `forward_batch_wo_cache_flash_attn`
* add input tensors as checkpoints
so that recursive tensor cloning of gradient checkpointing terminates on input tensors
* fix variable name and add missing boolean negation
* make sure some tensors are not reallocated by inserting new temporary nodes depending on them:
output and parameter gradient tensors need to be available at the end of the graph execution
parameter gradient tensors also need to be available before the graph execution because they are set to zero before each optimizer iteration
checkpoint tensors are allocated all together to reduce memory allocator fragmentation
afterwards, in addition to the temporary nodes, we also need to reset the temporary leafs
* fix ASSERT to work with zero layers
* add training options whether to use allocator and/or unified training function
* integrate unified training function which may use memory allocator
the unified training function also supports arguments whether to use flash attention and/or gradient checkpointing
* format name of cloned tensors with " (clone)" suffix
* set names for tensors in unified train function for easier debugging
* allocate graph on context using ggml_new_graph
* remove handwritten training functions
* remove unused training parameters "use_scratch" and "use_unified"
* remove trailing whitespace
* remove unused train params: mem_compute1_gb & mem_compute2_gb
mem_compute_gb is used for compute when automatic memory allocator is not enabled, otherwise it can be very small to only hold the tensor definitions
mem_compute0_gb is used for automatic memory allocator (as long as measurement of max required size is not implemented)
* remove unused forward_batch function
* add debug asserts in ggml_allocr_alloc to some common pitfalls when using this function directly
* only use ggml_allocr_alloc when tensor has NULL data and is no view
* fix test when to create temporary backward graph
temporary backward graph is only necessary when using checkpointing
* fix memory "leak" in optimizers
each iteration a new cplan with new memory for work data was allocated.
now cplan creation only happens at the start of optimization, with each iteration reusing the cplan and its work data.
* reverse order of for loop in ggml_build_backward_expand to save memory when using gradient checkpointing and allocator
with this loop order gradient checkpointing with allocator on 16 layer model saves 13% memory; 2 layer memory it saves 2% memory.
the computation results are the same
* add missing lctx argument to get_example_targets_batch
* implement llama model file saving using gguf
checkpoint loading and saving disabled, to be replaced by loading and saving via gguf
* implement loading/saving of checkpointing files using GGUF
* bug fixes
* add checkpoint file version for future compatibility
* update readme with gguf filenames
* save & load opt->just_initialized value
* add first draft for checkpoint conversion script
* add gguf arch and ftype
* save opt parameter counter as uint64
* add gguf key and tensor names for optimizer and training
* add layer_norm_rms_eps to checkpoint convert script
* use same GGUF_GET_KEY macro as in llama.cpp
* use norm_rms_eps, and rope parameters and command line options to set them
* fix memory corruption bug in gguf
ctx->kv and ctx->infos was reallocated using not-aligned realloc, but freed with aligned free.
to fix this a GGML_ALIGNED_REALLOC was added, but there is no posix_memalign_realloc function.
so on non-windows and non-mingw32 platforms we fall back to aligned malloc, followed by copying
and freeing the old data.
* add gguf example cmake file
* bug fixes in tokenize_file
* bug fixes in load_llama_model_gguf
* bug fix: init model when no checkpoint was loaded
* bug fix in read_tensor_by_name
* bug fix in load_opt_context_gguf
* avoid printing lots of spaced on the unusual case that loss gets nan
* set name of tensors with empty name from what was read from gguf
* remove trailing whitespace
* print data checksums before saving and after loading to verify correctness
* bug fixes for convert-train-checkpoint-to-gguf
* temporarily add code to write old checkpoint files
used to verify that old checkpoint files are correctly converted to gguf
* bug fixes for convert-train-checkpoint-to-gguf.py loading checkpoints with opt_version=0
* remove code used to verify correctness of checkpoint file conversion
* remove trailing whitespace
* remove prediction related code
use main for prediction, it is better optimized
* update train-text-from-scratch README.md
* fix non-windows GGML_ALIGNED_REALLOC
* add missing blank line at end of file
* remove GGML_ALIGNED_REALLOC and use normal malloc/realloc/free for gguf ctx->kv & ctx->infos
* train : fix compile warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : fix memory leak
* metal : fix encoders memory leak
* metal : clean up more memory resources
* metal : fix more leaks
* metal : reuse dispatch queue + autoreleasepool
* metal : reuse array for command buffers and encoders
* ggml : assert for odd number of blocks on ARM
15M tinyllama is an example
* llama2.c: direct gguf output (WIP)
* Simplify vector building logic
* llama2.c gguf conversion: fix token types in converter
* llama2.c: support copying vocab from a llama gguf model file
* llama2.c: update default path for vocab model + readme
* llama2.c: use defines for gguf keys
* llama2.c: escape whitespaces w/ U+2581 in vocab converter the llama.cpp way
* llama2.c converter: cleanups + take n_ff from config
* Speedup tokenization
On current master it takes ~3.2 seconds to tokenize
Wikitext. With this change it becomes ~525 ms.
* Fixit: it was missing the piece after the last found occurence
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make ggml-cuda.cu build with QK_K = 64
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.
* k_quants tuning for Falcon-7b
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* tests : write a Python tokenizer test (wip)
* llama : prefix input text for tokenization with whitespace
* llama : distinguish pieces from decoded text + fix detokenization
* common : add comments
* examples : no longer manually add leading space when tokenizing
* tests : use Python to generate tokenizer tests for C++
* tests : add option to tokenize text files
ggml-ci
* tests : add test-tokenizer-1.py
* llama.cpp : fix LF token
* hellaswag : move the concat space for clarity
* tests : add falcon tests (py + cpp, currently do not pass Unicode)
ggml-ci
* common : temporary separate llama_detokenize calls for SPM and BPE
---------
Co-authored-by: klosax <131523366+klosax@users.noreply.github.com>
* ci : add lora test
ggml-ci
* move lora summary to the top, add lora logs
ggml-ci
* ci : decrease CPU ppl runs to 2 to avoide 20 min timeout
ggml-ci
* add 7b lora test
use 1 thread for CUDA generation tests
ggml-ci
* add test with q8_0 (cpu only)
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Allow convert.py to convert to q8_0
Fix issue with bounded_parallel_map and greedy consuming iterator
Display elapsed time during conversion
* Add --concurrency option
Minor improvements to help text
Clean up bounded_parallel_map function a bit
* Massive speed improvement thanks to Cebtenzzre
* Refactor types
The use of special characters within source files can break compiling on some computers with different region and language settings. Using Unicode escape sequences should allow for the code to be compiled on all setups without needing to change your computers settings or switch regions.
* Fix bug in main.cpp where penalize_nl=false has no effect. It modifies the underlying logits array, but at this point we are already working on the candidates copy.
* Suppress redefinition warning for NOMINMAX on mingw. In my installation, this macro is already defined by /usr/lib/gcc/x86_64-w64-mingw32/11/include/c++/x86_64-w64-mingw32/bits/os_defines.h:45.
* main : fix indentation
* main : pass ctx to llama_token_nl()
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Problem
-------
`nix build` fails with missing `Accelerate.h`.
Changes
-------
- Fix build of the llama.cpp with nix for Intel: add the same SDK frameworks as
for ARM
- Add `quantize` app to the output of nix flake
- Extend nix devShell with llama-python so we can use convertScript
Testing
-------
Testing the steps with nix:
1. `nix build`
Get the model and then
2. `nix develop` and then `python convert.py models/llama-2-7b.ggmlv3.q4_0.bin`
3. `nix run llama.cpp#quantize -- open_llama_7b/ggml-model-f16.gguf ./models/ggml-model-q4_0.bin 2`
4. `nix run llama.cpp#llama -- -m models/ggml-model-q4_0.bin -p "What is nix?" -n 400 --temp 0.8 -e -t 8`
Co-authored-by: Volodymyr Vitvitskyi <volodymyrvitvitskyi@SamsungPro.local>
* llama.cpp : fix spm whitespace escaping + clean up
* main.cpp : spm - add whitespace in front of prompt
* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
* Add llama_beam_search().
* Add '// Beam search' heading to llama.{h,cpp} after llama_grammar_accept_token().
* Add space around * pointers and & references.
* Add spaces around comparison and assignment operators.
* Prefer west const.
* Use llama_ prefix for structs in global namespace.
* Delete obsolete comment from an earlier revision.
* Change eos to eob in llama_beam and llama_beam_view structs.
* server : add n_probs param in chat UI
* server : keep message data array & show in probabilites component
* server : add simple popover component
* server : fix completion_probabilities undefined if not set n_probs
* server : implement Probabilites
* server : handle bytes
* server : make n_probs max to 10 for easy scroll
* server : adjust for dark/light mode
* server : Fix regenerated prompt
* server : update index.html.hpp
* server : convert prob to percentage + show original value as div title
* server : fix Probabilites not used if included empty str
* server : skip byte pair in display probabilites
* server : remove array check of completion_probabilities in messages
* skip empty array or byte pair (> 1) in Probabilites
* generate index.html.hpp
* fix incorrect prob convert if the str is already a known token
* use final response to show probabilities on stop
* revert unnecessary change
* correct probabilites usage
* remove unused function
* always send partial response for get correct probs of last to_send
* fix typo
* fix content of format_final_response
* refactor probs render & make pColor transparent if not found
* send empty string when got stop_pos in partial
* avoid unnecessary empty data event & send rest of partial tokens on stop
* use <br /> for new line
* skip -1 tok in loop to avoid send '' on end
* trim last new lines on stop
* revert unnecessary change
* metal: better memory alloc w/ concurrency dispatch
The ggml-alloc should only free tensors at memory barriers.
* ggml-alloc: avoid return silently
In certain cases, the allocate_node() function may silently return
without performing any memory allocation.
* Implementing strided computation of perplexity
* Alternative way to output PPL results
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: allow json array in prompt or content
We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.
This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.
With this, we can use EOS and BOS used in llama-2-chat models.
* server: use tokenizePrompt(json) and default "" if empty prompt
* server: fix prompt check
* server: tokenize endpoint no longer adds BOS
* Improve UNK, BOS, EOS token handling when converting without metadata.
* Allow importing as a module.
* Remove some obsolete code and minor cleanups.
* Set default UNK token mapping from -1 to 0 in llama.cpp
* Try to handle overflow due to buggy Windows Python with a better error message
* Improve LLaMA-2 2-, 3- and 4-bit quantization
* Q3_K_S: use Q5_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q4_K_S: use Q6_K for 1st 2 layers of attention.wv and feed_forward.w2
* Q2_K and Q3_K_M: use Q5_K instead of Q4_K for 1st 2 layers of
attention.wv and feed_forward.w2
This leads to a slight model sized increase as follows:
Q2_K : 2.684G vs 2.670G
Q3_K_S: 2.775G vs 2.745G
Q3_K_M: 3.071G vs 3.057G
Q4_K_S: 3.592G vs 3.563G
LLaMA-2 PPL for context 512 changes as follows:
Q2_K : 6.6691 vs 6.8201
Q3_K_S: 6.2129 vs 6.2584
Q3_K_M: 6.0387 vs 6.1371
Q4_K_S: 5.9138 vs 6.0041
There are improvements for LLaMA-1 as well, but they are
way smaller than the above.
* Minor 4-bit quantization improvement
For the same model size as previus commit, we get
PPL = 5.9069 vs 5.9138.
* Some more fine tuning
* Adding make_qkx2_quants
With it, we get PPL = 5.8828 for L2-7B Q4_K_S.
* Another minor improvement
* Q2_K improvement
Smaller model, lower perplexity.
7B: file size = 2.632G, PPL = 6.3772 vs original 2.670G PPL = 6.8201
12B: file size = 5.056G, PPL = 5.4577 vs original 5.130G PPL = 5.7178
It is mostly Q3_K except for tok_embeddings, attention.wq, attention.wk,
which are Q2_K
* Iterating
* Revert Q5_K back to make_qkx1_quants
* Better Q6_K
* make_qkx2_quants is better for Q5_K after all
* Fix after rebasing on master
* Fix for changed tensor names
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
use a different function for no_alloc to avoid breaking backwards compat, fixes lora
remove 512 n_batch limit
fixed 2048 batch size
cleanup
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml: support CUDA's half type for aarch64(#1455)
support CUDA's half type for aarch64 in ggml_fp16_t definition
* ggml: use __CUDACC__ to recognise nvcc compiler
* llama : add benchmark example
* add to examples CMakeLists.txt
* fix msvc build
* add missing include
* add Bessel's correction to stdev calculation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* improve markdown formatting
* add missing include
* print warning is NDEBUG is not defined
* remove n_prompt and n_gen from the matrix, use each value separately instead
* better checks for non-optimized builds
* llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the first call
* fix json formatting
* add sql output
* add basic cpu and gpu info (linx/cuda only)
* markdown: also show values that differ from the default
* markdown: add build id
* cleanup
* improve formatting
* formatting
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* support for templates in browser LocalStorage
* sync accepted #2409 fix from upstream
* convert autosave invocation to useEffect
* Apply suggestions from code review
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
* Regen index.html.cpp, suggested from code review
---------
Co-authored-by: Jhen-Jie Hong <iainst0409@gmail.com>
* adds simple llama grammar tests
* fix lint and add Makefile
* 0 terminate code_points
* avoid dangling pointers in candidate cleanup
* cleanup grammar at end of test
The GGML memory allocator consistently places a tensor within the
optimal-fit memory block, which is the smallest block capable of
accommodating the tensor's size. During the measurement phase, the final
block is generously sized, ensuring it never qualifies as the
optimal-fit block as long as there exists another block capable of
accommodating the tensor. Nevertheless, in the evaluation phase, the
last block is constrained in size and could potentially qualify as the
optimal-fit block. Consequently, there exists the possibility of a
tensor being allocated to a different region during evaluation, leading
to more memory fragmentation in our scratch buffer.
This recent commit guarantees uniform behavior of the allocator across
both the measurement and evaluation phases, eliminating discrepancies
between the two.
* metal: matrix-matrix multiplication kernel
This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.
* metal: fix performance degradation from gqa
Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.
* metal: fix bugs for GQA and perplexity test.
I mixed up ne02 and nb02 in previous commit.
* server : implement json-schema-to-grammar.mjs by follow python impl
* server : add grammar support in chat.mjs
* server : implement grammer param in the UI
* server : generate .hpp
* server : remove trailing whitespaces
* server : generate .hpp
* server : fix sort of prop pairs
* server : optimize regex & iteration
* ggml-alloc: Don't try to re-use buffers of external tensors
They might be weights that came from another context, so we
have no control over them (and they might be re-used elsewhere
so writing to them would be a bad idea).
* ggml-alloc: >= when checking for out-of-bounds
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* add log_callback to llama_context_params for custom logging.
* Fix macro expansion on gcc
* Add struct llama_state for global variables and move log_callback there
* Turn log level into enum and some minor changes.
* Remove model_for_logging parameter (not needed anymore)
* Convert remaining fprintf(stderr, ...) calls to use new macros.
* Fix enum and initialize g_state
* Fix log calls after merge
* Fix missing static
* Add back all the new lines in the logging strings
* Add comment for llama_log_callback and replace remaining printf calls
---------
Co-authored-by: grahameth <->
Co-authored-by: Helmut <helmut.buhler@inf.h-brs.de>
* Update Vim plugin
* Remove getbufoneline usage, Add input bind example.
getbufoneline() appears to be a recently added function and has been
replaced with getbufline for compatibility.
An additional example that explains how to add a keybind that works in
insert mode was added.
* added stream saving context data to file to avoid allocating unnecessary amounts of memory
* generalised copying state data to file or buffer
* added comments explaining how copy_state_data works
* fixed trailing whitespaces
* fixed save load state example
* updated save load state to use public function in llama.cpp
* - restored breakage of the llama_copy_state_data API
- moved new logic for copying llama state data to internal function
* fixed function declaration order
* restored save load state example
* fixed whitepace
* removed unused llama-util.h include
* Apply suggestions from code review
Co-authored-by: slaren <slarengh@gmail.com>
* Apply code review suggestions
Co-authored-by: slaren <slarengh@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* examples : add JSON schema grammars
* complete JSON grammar
* ensure primitive types can be used as root of schema
* support integer type and adjust usage text
* fix hellaswag print format, cast away warning in test-double-float
* c++11 cannot use designated initializers
* add static to test-grad0.c internal functions
* use memcpy in test-double-float.c
* port c tests to c++
* use initializer list for ggml_init_params
* ggml : add graph tensor allocator
* ggml : don't calculate data pointer of unallocated tensors when creating a view with an offset
* ggml : refactor ggml_view_Nd into ggml_view_tensor_offset
* ggml : graph allocation in contexts
* allocate work buffer as a ggml_object in ggml_graph_compute_with_ctx
* llama.cpp : allocate graph in the context
* add GGML_PAD
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add `--in-prefix-bos` to prefix BOS to user inputs; keep EOS
The BOS precedes the string specified by `--in-prefix`.
Model generated EOS is now kept in the context.
It provides a way to strictly following the prompt format used in
Llama-2-chat.
The EOS handling also benefits some existing finetunes that uses
EOS to mark the end of turn.
* examples/common: move input_prefix_bos to other bools
* metal: concurrently dispatch commands
Function `ggml_metal_graph_find_concurrency` will run and write
commands that can be issued concurrently to metal context `concur_list`
array, when `ggml_metal_graph_compute` is called for the first time.
* metal: don't call find_concurrency automatically.
* metal : code style changes
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Another speed gain for Q4_0 and Q4_1 on Metal
* Have N_DST, etc., be template parameters
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* make rms_norm_eps a parameter
* add rms_norm_eps to command line
* fix baby llama, test-grad0
* use scientific notation for eps param in the help
ggml-ci
* makefile: correct deps for server
* server: tighten settings layout a little
* server: expose all currently configured generation params in UI
* server: expose remaining generation params, for the adventurous
* server: embetter mirostat fields
* llama, main : constrain sampling to grammar
* allow loading grammar from file
* fix whitespace errors
* handle & print parser errors
* add comments to grammar syntax and allow newlines where unambiguous
* add missing include
* support alternates in root rule
* fix bugs with empty token and EOS
* adjust JSON grammar
* remove swp file
* rewrite ternary expressions
Co-authored-by: Henri Vasserman <henv@hot.ee>
* use struct for grammar elements and add Unicode support
* add unicode escapes
* add inverse char ranges
* only sample full tokens (no peeking or truncation)
* llama : minor style changes
blindly applied in online editor - hopefully I didn't break something
* update help text
* add warning message if EOS is disabled
---------
Co-authored-by: Henri Vasserman <henv@hot.ee>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix Makefile for CLBLAST compile support and instructions for compile llama.cpp FreeBSD
* More general use-case for CLBLAST support (Linux and FreeBSD)
* ci : add 7B CUDA tests
ggml-ci
* ci : add Q2_K to the tests
* ci : bump CUDA ppl chunks
ggml-ci
* ci : increase CUDA TG len + add --ignore-eos
* ci : reduce CUDA ppl cunks down to 4 to save time
* Resync my fork with new llama.cpp commits
* examples : rename to use dash instead of underscore
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Custom RoPE + bettter memory management for CUDA
* Adjusted look ahead in ggml_cuda_pool_malloc to 5%
This is sufficient it seems.
We end up using about 200 MB less VRAM that way when running
the 13B model with context 8192.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
A fix in Makefile for FreeBSD users. In the platfrom x86_64 is amd64. This fix resolve compilation using CFLAGS and CXXFLAGS with -march=native and -mtune=native
Add two examples for interactive mode using Llama2 models (thx TheBloke for models)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
NixOS's mkl misses some libraries like mkl-sdl.pc. See #2261
Currently NixOS doesn't have intel C compiler (icx, icpx). See https://discourse.nixos.org/t/packaging-intel-math-kernel-libraries-mkl/975
So remove it from flake.nix
Some minor changes:
- Change pkgs.python310 to pkgs.python3 to keep latest
- Add pkgconfig to devShells.default
- Remove installPhase because we have `cmake --install` from #2256
Programs in the tests directory are now build with target tests
and placed in the same location.
* clean target was expanded to remove new binaries
* test target binaries are listed in a variable
* Locations of binaries were added to the .gitignore
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Miku.sh: Set default model to llama-2-7b-chat
* Miku.sh: Set ctx_size to 4096
* Miku.sh: Add in-prefix/in-suffix opts
* Miku.sh: Switch sampler to mirostat_v2 and tiny prompt improvements
* Faster Q2_K on Metal
* Deleting unnoticed and dangereous trailing white space
* Fixed bug in new metal Q2_K implementation
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* metal: use uint16_t instead of uint8_t.
Apple GPU doesn't like uint8_t. For every operation on uint8_t
the gpu need to copy the uint8_t to an empty 16 bit register, then
it can issue other instructions.
For the matrix-vector multiplication kernel only, we observed a
340~350 GB/s memory read speed on M1 Max after this commit, which is
very close to the reported hardware limit.
* metal: update rms_norm kernel
This commit double the speed of rms_norm operations by using 512 threads
per threadgroup, combining with SIMD primitives to minimize the need for
thread group barriers.
* metal: use template to reduce size
Revert modifications on block_q4_0 and block_q4_1.
* ci : run ctest
ggml-ci
* ci : add open llama 3B-v2 tests
ggml-ci
* ci : disable wget progress output
ggml-ci
* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations
ggml-ci
* tests : try to fix tail free sampling test
ggml-ci
* ci : add K-quants
ggml-ci
* ci : add short perplexity tests
ggml-ci
* ci : add README.md
* ppl : add --chunks argument to limit max number of chunks
ggml-ci
* ci : update README
* Implement customizable RoPE
The original RoPE has pre-defined parameters
theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]
Our customizable RoPE, ggml_rope_custom_inplace, uses
theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]
with the default matches the original
scale = 1.0
base = 10000
The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.
Recent researches show changing these two parameters extends the context limit with minimal loss.
1. Extending Context to 8K
kaiokendev
https://kaiokendev.github.io/til#extending-context-to-8k
2. Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
https://arxiv.org/abs/2306.15595
3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
https://www.reddit.com/user/bloc97https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
For the bold, try adding the following command line parameters to your favorite model:
-c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5
* ggml-metal: fix custom rope
* common: fix argument names in help
* llama: increase MEM_REQ_EVAL for MODEL_3B
It avoids crashing for quantized weights on CPU.
Better ways to calculate the required buffer size would be better.
* llama: make MEM_REQ_EVAL depend on n_ctx
* server: use proper Content-Type in curl examples
Without the header Content-Type: application/json, curl will POST with
Content-Type: application/x-www-form-urlencoded
Though our simple server doesn't care, the httplib.h used has a limit
with CPPHTTPLIB_FORM_URL_ENCODED_PAYLOAD_MAX_LENGTH 8192
With Content-Type: application/json, we can send large json data.
* style : minor fixes, mostly indentations
* ggml : fix asserts
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* 3-5% faster Q4_0 on Metal
* 7-25% faster Q4_1 on Metal
* Oops, forgot to delete the original Q4_1 kernel
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Initial implementation
* Remove debug print
* Restore signature of llama_init_from_gpt_params
* Free guidance context
* Make freeing of guidance_ctx conditional
* Make Classifier-Free Guidance a sampling function
* Correct typo. CFG already means context-free grammar.
* Record sampling time in llama_sample_classifier_free_guidance
* Shift all values by the max value before applying logsoftmax
* Fix styling based on review
* This allows LLAMA models that were previously incompatible with K quants to function mostly as normal. This happens when a model has a vocab != 32000, e.g 32001 which means it's not divisible by 256 or 64. Since the problematic dimensions only apply for `tok_embeddings.weight` and `output.weight` (dimentions 4096 x n_vocab), we can simply quantize these layers to Q8_0 whereas the majority of the hidden layers are still K-quanted since they have compatible dimensions.
* Fix indentation
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* As an alternative, to avoid failing on Metal due to lack of Q8_0 support, instead quantize tok_embeddings.weight to Q4_0 and retain output.weight as F16. This results in a net gain of about 55mb for a 7B model compared to previous approach, but should minimize adverse impact to model quality.
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The file pathing is significant when running models inside of Termux on Android devices. llama.cpp performance is improved with loading a .bin from the $HOME directory.
* ggml_graph_compute: deprecate using ggml_context, try resolve issue #287
* rewrite: no longer consider backward compitability; plan and make_plan
* minor: rename ctx as plan; const
* remove ggml_graph_compute from tests/test-grad0.c, but current change breaks backward
* add static ggml_graph_compute_sugar()
* minor: update comments
* reusable buffers
* ggml : more consistent naming + metal fixes
* ggml : fix docs
* tests : disable grad / opt + minor naming changes
* ggml : add ggml_graph_compute_with_ctx()
- backwards compatible API
- deduplicates a lot of copy-paste
* ci : enable test-grad0
* examples : factor out plan allocation into a helper function
* llama : factor out plan stuff into a helper function
* ci : fix env
* llama : fix duplicate symbols + refactor example benchmark
* ggml : remove obsolete assert + refactor n_tasks section
* ggml : fix indentation in switch
* llama : avoid unnecessary bool
* ggml : remove comments from source file and match order in header
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in https://github.com/ggerganov/llama.cpp/issues/382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.
* use javascript generators as much cleaner API
Also add ways to access completion as promise and EventSource
* export llama_timings as struct and expose them in server
* update readme, update baked includes
* llama : uniform variable names + struct init
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update server instructions for web front end
* Update server README
* Remove duplicate OAI instructions
* Fix duplicate text
---------
Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>
* Generalize quantize_fns for simpler FP16 handling
* Remove call to ggml_cuda_mul_mat_get_wsize
* ci : disable FMA for mac os actions
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* expose simple web interface on root domain
* embed index and add --path for choosing static dir
* allow server to multithread
because web browsers send a lot of garbage requests we want the server
to multithread when serving 404s for favicon's etc. To avoid blowing up
llama we just take a mutex when it's invoked.
* let's try this with the xxd tool instead and see if msvc is happier with that
* enable server in Makefiles
* add /completion.js file to make it easy to use the server from js
* slightly nicer css
* rework state management into session, expose historyTemplate to settings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: add option to output probabilities for completion
* server: fix issue when handling probability output for incomplete tokens for multibyte character generation
* server: fix llama_sample_top_k order
* examples/common.h: put all bool variables in gpt_params together
It's currently not possible to cross-compile llama.cpp for aarch64
because CMakeLists.txt forces -mcpu=native for that target.
-mcpu=native doesn't make sense if your build host is not the
target architecture, and clang rejects it for that reason, aborting the
build. This can be easily reproduced using the current Android NDK to build
for aarch64 on an x86_64 host.
If there is not a specific CPU-tuning target for aarch64 then -mcpu
should be omitted completely. I think that makes sense, there is not
enough variance in the aarch64 instruction set to warrant a fixed -mcpu
optimization at this point. And if someone is building natively and wishes
to enable any possible optimizations for the host device, then there is
already the LLAMA_NATIVE option available.
Fixes#495.
* convert checks in llama_load_session_file to throw and handle them
* make llama_load_session_file_internal static
* address feedbacks to avoid using exceptions
* Added broken new q4k quant
* xx + ib0
* Fix q2_k fast kernel
* Use preprocessor for QK_K
* Add q6_k fast matmul kernel
* ported q3k speedup successfully
* ported q2k and q5k speedups
* remove old dot kernels and template
* fixed global const struct types
* fixing address spaces
* fixed string too long CI issue
---------
Co-authored-by: 0cc4m <picard12@live.de>
* Remove multiple shards
* Remove multiple file loaders
* Remove llama_load_tensor_shard class
* Simplify load logic
* Remove dead code guess_n_parts function
* Remove vocab_only from constructor of llama_model_loader
* Remove alignment_prevents_mmap which is not more needed.
* Remove useless check
* add interface for float input
* fixed inpL shape and type
* add examples of input floats
* add test example for embd input
* fixed sampling
* add free for context
* fixed add end condition for generating
* add examples for llava.py
* add READMD for llava.py
* add READMD for llava.py
* add example of PandaGPT
* refactor the interface and fixed the styles
* add cmake build for embd-input
* add cmake build for embd-input
* Add MiniGPT-4 example
* change the order of the args of llama_eval_internal
* fix ci error
* Clean up compiler warnings in train-text
Some brackets to disambiguate order of operations
* Increase GGML_MAX_NAME
Avoiding strncpy danger in train-text-from-scratch and reducing potential future name length issues
* docs - Alternative way to build at Android, with CLBlast.
* doc - LD_LIBRARY_PATH complement for some Android devices when building with CLBlast inside Termux.
* doc- fix typo
* detect NUMA systems and pin work threads to nodes (linux)
* disable mmap prefetch/readahead for NUMA systems
* avoid sending finalize op to thread pool if it does nothing
* silence robot
* fix args
* make --numa a param
* recommendation that n_nodes evenly divide n_threads did not warrant such aggressive enforcement
* lower synchronization overhead
* statically allocate
* move numa state to g_state
* add description for --numa
* ggml : minor style changes
* ggml : minor style + try fix sanitizer build
* llama : allow to initialize backend with NUMA support
* llama : avoid ggml include in llama-util.h
* ggml : style / formatting
* ggml : fix handling of ops with n_threads > n_tasks > 1
* server : utilize numa parameter
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* k_quants: WIP super-blocks with 64 weights
* k_quants: WIP super-blocks with 64 weights
Q6_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q4_K scalar and AVX2 works
* k_quants: WIP super-blocks with 64 weights
Q2_K scalar and AVX2 works. Q2_K is way too slow (it is actually slower
than the scalar implementation)
* k_quants: WIP super-blocks with 64 weights
Q3_K scalar and AVX2 works.
* k_quants: WIP super-blocks with 64 weights
Q5_K scalar and AVX2 works, and with that all
k_quants are done on AVX2 and scalar
* k_quants: WIP super-blocks with 64 weights
Q6_K working on CUDA. Cannot make it run quite as gast as
with super-blocks with 256 weigths: 8% slower on 4080,
20% slower on the 1660 (but there we fit 1 less layer on the
GPU because pf the larger model size), so some fraction of
these 20% is due to that,
* k_quants: WIP super-blocks with 64 weights
Q4_K working on CUDA. ~10% slower on GTX-1660,
16% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q2_K working on CUDA. ~3% slower on GTX-1660,
10% slower on 4080.
* k_quants: WIP super-blocks with 64 weights
Q3_K working on CUDA.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on CUDA, and with this CUDA is done.
* k_quants: WIP super-blocks with 64 weights
Q6_K working on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Q4_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q2_K working on ARM_NEON, but quite a bit slower than 256 weights
* k_quants: WIP super-blocks with 64 weights
Q3_K working on ARM_NEON, but quite a bit slower than 256 weights.
* k_quants: WIP super-blocks with 64 weights
Q5_K working on ARM_NEON, but quite a bit slower than 256 weights.
With that, we have full support for ARM_NEON, although
performance is not quite there.
* k_quants: WIP super-blocks with 64 weights
Slightly more efficient Q3_K and Q5_K
* k_quants: WIP super-blocks with 64 weights
Another small improvement for Q3_K and Q5_K on ARM_NEON
* k_quants: WIP super-blocks with 64 weights
Yet another speedup for Q5_K on ARM_NEON.
We are now within 10% of the QK_K = 256 version.
* k_quants: WIP super-blocks with 64 weights
* We are able to pass preprocessor macros to the Metal
compiler
* Q6_K works and is actually slightly more efficient than
the QK_K = 256 version (25.2 ms vs 25.8 ms)
* k_quants: WIP super-blocks with 64 weights
Q4_K works on Metal and is actually slightly faster
than QK_K = 256 (21.95 ms vs 24.0 ms).
* k_quants: WIP super-blocks with 64 weights
Q2_K works on Metal and is very slightly faster
than QK_K = 256 (23.8 ms vs 24.2 ms).
* k_quants: WIP super-blocks with 64 weights
Q3_K works on Metal and is slightly faster
than QK_K = 256 (26.6 ms vs 28.3 ms).
* k_quants: WIP super-blocks with 64 weights
Q5_K works on Metal and is slightly faster
than QK_K = 256 (23.7 ms vs 26.3 ms).
* k_quants: call them _K, not _k, also on Metal
* k_quants: correctly define QK_K in llama.cpp
* Fixed bug in q4_K quantization added with the 64-block addition
* Simplify via lambda
* k_quants: swicth Q3_K to 4-bit scales when QK_K = 64
Otherwise there isn't much benefit from this
quantization type. There is some very slight loss
in accuracy, but we reduce size by ~7%.
E.g., for OpenLLaMA-3B, Q3_K_S perplexity is
8.6131 with 8-bit scales and 8.6352 with 4-bit,
while file size decreases from 1.53G to 1.44G.
* k_quants: switch Q4_K to 4-bit scales when QK_K = 64
Here the loss in accuracy is greater than for Q3_K,
but the Q4_K points still move further to the left on
the perplexity vs size curve.
* k_quants: forgot to add the Metal changes in last commit
* k_quants: change Q5_K to be type 0 when QK_K = 64
Still needs AVX2 implementation
* k_quants: AVX2 implementation for new 64-weight Q5_K
* k_quants: 10% faster ARM_NEON Q5_K dot product
* k_quants: fixed issue caused by merging with master
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
about: Used to report issues and request enhancements for llama.cpp
title: "[User] Insert summary of your issue or enhancement.."
labels: ''
assignees: ''
---
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
# Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected `llama.cpp` to do.
# Current Behavior
Please provide a detailed written description of what `llama.cpp` did, instead.
# Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
* Physical (or virtual) hardware you are using, e.g. for Linux:
`$ lscpu`
* Operating System, e.g. for Linux:
`$ uname -a`
* SDK version, e.g. for Linux:
```
$ python3 --version
$ make --version
$ g++ --version
```
# Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
# Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
1. step 1
2. step 2
3. step 3
4. etc.
# Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to **avoid using screenshots** if at all possible. Instead, copy/paste the console output and use [Github's markdown](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to cleanly format your logs for easy readability.
Example environment info:
```
llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c
llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
main: mem per token = 71159620 bytes
main: load time = 19309.95 ms
main: sample time = 168.62 ms
main: predict time = 223895.61 ms / 888.47 ms per token
main: total time = 246406.42 ms
Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [ ] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [ ] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ ] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.
# Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected `llama.cpp` to do as an enhancement.
# Motivation
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to `llama.cpp` users.
# Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
# This workflow will upload a Python Package using Twine when a GGUF release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
# See `gguf-py/README.md` for how to make a release.
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
message(WARNING"Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.")
@@ -57,12 +62,11 @@ The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quant
- Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- Mixed F16 / F32 precision
- 4-bit, 5-bit and 8-bit integer quantization support
-Supports OpenBLAS/Apple BLAS/ARM Performance Lib/ATLAS/BLIS/Intel MKL/NVHPC/ACML/SCSL/SGIMATH and [more](https://cmake.org/cmake/help/latest/module/FindBLAS.html#blas-lapack-vendors) in BLAS
- cuBLAS and CLBlast support
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
-CUDA, Metal, OpenCL, SYCL GPU backend support
The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
Since then, the project has improved significantly thanks to many contributions. This project is for educational purposes and serves
Since then, the project has improved significantly thanks to many contributions. This project is mainly for educational purposes and serves
as the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.
**Supported platforms:**
@@ -75,115 +79,149 @@ as the main playground for developing new features for the [ggml](https://github
**Notes:** With this packages you can build llama.cpp with OPENBLAS and
CLBLAST support for use OpenCL GPU acceleration in FreeBSD. Please read
the instructions for use and activate this options in this document below.
### Metal Build
Using Metal allows the computation to be executed on the GPU for Apple devices:
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
To disable the Metal build at compile time use the `LLAMA_NO_METAL=1` flag or the `LLAMA_METAL=OFF` cmake option.
When built with Metal support, you can explicitly disable GPU inference with the `--n-gpu-layers|-ngl 0` command-line
argument.
### MPI Build
MPI lets you distribute the computation over a cluster of machines. Because of the serial nature of LLM prediction, this won't yield any end-to-end speed-ups, but it will let you run larger models than would otherwise fit into RAM on a single machine.
First you will need MPI libraries installed on your system. The two most popular (only?) options are [MPICH](https://www.mpich.org) and [OpenMPI](https://www.open-mpi.org). Either can be installed with a package manager (`apt`, Homebrew, MacPorts, etc).
Next you will need to build the project with `LLAMA_MPI` set to true on all machines; if you're building with `make`, you will also need to specify an MPI-capable compiler (when building with CMake, this is configured automatically):
- Using `make`:
```bash
LLAMA_METAL=1 make
make CC=mpicc CXX=mpicxx LLAMA_MPI=1
```
- Using `CMake`:
```bash
mkdir build-metal
cd build-metal
cmake -DLLAMA_METAL=ON ..
cmake --build . --config Release
```
```bash
cmake -S . -B build -DLLAMA_MPI=ON
```
When built with Metal support, you can enable GPU inference with the `--gpu-layers|-ngl` command-line argument.
Any value larger than 0 will offload the computation to the GPU. For example:
Once the programs are built, download/convert the weights on all of the machines in your cluster. The paths to the weights and programs should be identical on all machines.
Next, ensure password-less SSH access to each machine from the primary host, and create a `hostfile` with a list of the hostnames and their relative "weights" (slots). If you want to use localhost for computation, use its local subnet IP address rather than the loopback address or "localhost".
Here is an example hostfile:
```
192.168.0.1:2
malvolio.local:1
```
The above will distribute the computation across 2 processes on the first host and 1 process on the second host. Each process will use roughly an equal amount of RAM. Try to keep these numbers small, as inter-process (intra-host) communication is expensive.
Finally, you're ready to run a computation using `mpirun`:
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Support with CPU-only BLAS implementations doesn't affect the normal generation performance. We may see generation performance improvements with GPU-involved BLAS implementations, e.g. cuBLAS, hipBLAS and CLBlast. There are currently several different BLAS implementations available for build and use:
- #### Accelerate Framework:
@@ -310,20 +391,37 @@ Building the program with BLAS support may lead to some performance improvements
Check [BLIS.md](docs/BLIS.md) for more information.
- #### Intel MKL
- #### Intel oneMKL
- Using manual oneAPI installation:
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. Otherwise please install oneAPI and follow the below steps:
```bash
mkdir build
cd build
source /opt/intel/oneapi/setvars.sh # You can skip this step if in oneapi-runtime docker image, only required for manual installation
By default, `LLAMA_BLAS_VENDOR` is set to `Generic`, so if you already sourced intel environment script and assign `-DLLAMA_BLAS=ON` in cmake, the mkl version of Blas will automatically been selected. You may also specify it by:
- Using oneAPI docker image:
If you do not want to source the environment vars and install oneAPI manually, you can also build the code using intel docker container: [oneAPI-runtime](https://hub.docker.com/r/intel/oneapi-runtime)
Building through oneAPI compilers will make avx_vnni instruction set available for intel processors that do not support avx512 and avx512_vnni.
Check [Optimizing and Running LLaMA2 on Intel® CPU](https://www.intel.com/content/www/us/en/content-details/791610/optimizing-and-running-llama2-on-intel-cpu.html) for more information.
- #### cuBLAS
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager (e.g. `apt install nvidia-cuda-toolkit`) or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
For Jetson user, if you have Jetson Orin, you can try this: [Offical Support](https://www.jetson-ai-lab.com/tutorial_text-generation.html). If you are using an old model(nano/TX2), need some additional operations before compiling.
- Using `make`:
```bash
make LLAMA_CUBLAS=1
@@ -339,12 +437,63 @@ Building the program with BLAS support may lead to some performance improvements
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
<!---
| LLAMA_CUDA_CUBLAS | Boolean | false | Use cuBLAS instead of custom CUDA kernels for prompt processing. Faster for all quantization formats except for q4_0 and q8_0, especially for k-quants. Increases VRAM usage (700 MiB for 7b, 970 MiB for 13b, 1430 MiB for 33b). |
| LLAMA_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. |
| LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
| LLAMA_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
- #### hipBLAS
This provides BLAS acceleration on HIP-supported AMD GPUs.
Make sure to have ROCm installed.
You can download it from your Linux distro's package manager or from here: [ROCm Quick Start (Linux)](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html).
- Using `make`:
```bash
make LLAMA_HIPBLAS=1
```
- Using `CMake` for Linux (assuming a gfx1030-compatible AMD GPU):
On Linux it is also possible to use unified memory architecture (UMA) to share main memory between the CPU and integrated GPU by setting `-DLLAMA_HIP_UMA=ON"`.
However, this hurts performance for non-integrated GPUs (but enables working with integrated GPUs).
- Using `make` (example for target gfx1030, build with 16 CPU threads):
```bash
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gxf1030
```
- Using `CMake` for Windows (using x64 Native Tools Command Prompt for VS, and assuming a gfx1100-compatible AMD GPU):
Make sure that `AMDGPU_TARGETS` is set to the GPU arch you want to compile for. The above example uses `gfx1100` that corresponds to Radeon RX 7900XTX/XT/GRE. You can find a list of targets [here](https://llvm.org/docs/AMDGPUUsage.html#processors)
Find your gpu version string by matching the most significant version information from `rocminfo | grep gfx | head -1 | awk '{print $2}'` with the list of processors, e.g. `gfx1035` maps to `gfx1030`.
The environment variable [`HIP_VISIBLE_DEVICES`](https://rocm.docs.amd.com/en/latest/understand/gpu_isolation.html#hip-visible-devices) can be used to specify which GPU(s) will be used.
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
| LLAMA_CUDA_DMMV_Y | Positive integer | 1 | Block size in y direction for the CUDA dequantization + mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
| LLAMA_CUDA_DMMV_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels. Can improve performance on relatively recent GPUs. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
- #### CLBlast
@@ -353,6 +502,8 @@ Building the program with BLAS support may lead to some performance improvements
You will need the [OpenCL SDK](https://github.com/KhronosGroup/OpenCL-SDK).
- For Ubuntu or Debian, the packages `opencl-headers`, `ocl-icd` may be needed.
- For Windows, a pre-built SDK is available on the [OpenCL Releases](https://github.com/KhronosGroup/OpenCL-SDK/releases) page.
- <details>
<summary>Installing the OpenCL SDK from source</summary>
@@ -370,10 +521,27 @@ Building the program with BLAS support may lead to some performance improvements
```
</details>
Installing CLBlast: it may be found in your operating system's packages.
##### Installing CLBlast
Pre-built CLBlast binaries may be found on the [CLBlast Releases](https://github.com/CNugteren/CLBlast/releases) page. For Unix variants, it may also be found in your operating system's packages.
Alternatively, they may be built from source.
- <details>
<summary>If not, then installing from source:</summary>
<summary>Windows:</summary>
```cmd
set OPENCL_SDK_ROOT="C:/OpenCL-SDK-v2023.04.17-Win-x64"
When running the larger models, make sure you have enough disk space to store all the intermediate files.
### Running on Windows with prebuilt binaries
You will find prebuilt Windows binaries on the release page.
Simply download and extract the latest zip package of choice: (e.g. `llama-b1380-bin-win-avx2-x64.zip`)
From the unzipped folder, open a terminal/cmd window here and place a pre-converted `.gguf` model file. Test out the main example like so:
```
.\main -m llama-2-7b.Q4_0.gguf -n 128
```
### Memory/Disk Requirements
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
@@ -457,6 +667,8 @@ As the models are currently fully loaded into memory, you will need adequate dis
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
@@ -478,6 +695,18 @@ For more information, see [https://huggingface.co/docs/transformers/perplexity](
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `main` example program.
`llama.cpp` supports grammars to constrain model output. For example, you can force the model to output JSON only:
```bash
./main -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
```
The `grammars/` folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](./grammars/README.md).
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
### Instruction mode with Alpaca
1. First, download the `ggml` Alpaca model into the `./models` folder
@@ -556,6 +797,8 @@ OpenLLaMA is an openly licensed reproduction of Meta's original LLaMA model. It
### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
*Note: these instructions are likely obsoleted by the GGUF update*
- Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
- Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
- Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
### Obtaining and using the Facebook LLaMA 2 model
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
- Alternatively, if you want to save time and space, you can download already converted and quantized models from [TheBloke](https://huggingface.co/TheBloke), including:
Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
@@ -596,7 +850,7 @@ Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files t
```bash
# run the verification script
python3 .\scripts\verify-checksum-models.py
./scripts/verify-checksum-models.py
```
- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory:
@@ -615,18 +869,6 @@ If your issue is with model generation quality, then please at least scan the fo
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
(Note: some Android devices, like the Zenfone 8, need the following command instead - "export LD_LIBRARY_PATH=/system/vendor/lib64:$LD_LIBRARY_PATH". Source: https://www.reddit.com/r/termux/comments/kc3ynp/opencl_working_in_termux_more_in_comments/ )
For easy and swift re-execution, consider documenting this final part in a .sh script file. This will enable you to rerun the process with minimal hassle.
Place your desired model into the `/llama.cpp/models/` directory and execute the `./main (...)` script.
Place your desired model into the `~/llama.cpp/models/` directory and execute the `./main (...)` script.
### Docker
@@ -697,10 +941,22 @@ Place your desired model into the `/llama.cpp/models/` directory and execute the
* Create a folder to store big models & intermediate files (ex. /llama/models)
#### Images
We have two Docker images available for this project:
We have three Docker images available for this project:
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file.
1. `ghcr.io/ggerganov/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
2. `ghcr.io/ggerganov/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
3. `ghcr.io/ggerganov/llama.cpp:server`: This image only includes the server executabhle file. (platforms: `linux/amd64`, `linux/arm64`)
Additionally, there the following images, similar to the above:
- `ghcr.io/ggerganov/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggerganov/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggerganov/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggerganov/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
- `ghcr.io/ggerganov/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
- `ghcr.io/ggerganov/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](.github/workflows/docker.yml). If you need different settings (for example, a different CUDA or ROCm library, you'll need to build the images locally for now).
#### Usage
@@ -715,13 +971,54 @@ docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-
On completion, you are ready to play!
```bash
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
```
or with a light image:
```bash
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512
docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.
You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.
The defaults are:
- `CUDA_VERSION` set to `11.7.1`
- `CUDA_DOCKER_ARCH` set to `all`
The resulting images, are essentially the same as the non-CUDA images:
1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.
#### Usage
After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
```bash
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
- There are no strict rules for the code style, but try to follow the patterns in the code (indentation, spaces, etc.). Vertical alignment makes things more readable and easier to batch edit
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & a`
- See [good first issues](https://github.com/ggerganov/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
- Tensors store data in row-major order. We refer to dimension 0 as columns, 1 as rows, 2 as matrices
- Matrix multiplication is unconventional: [`z = ggml_mul_mat(ctx, x, y)`](https://github.com/ggerganov/llama.cpp/blob/880e352277fc017df4d5794f0c21c44e1eae2b84/ggml.h#L1058-L1064) means `zT = x @ yT`
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
The llama.cpp for SYCL is used to support Intel GPUs.
For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
## OS
|OS|Status|Verified|
|-|-|-|
|Linux|Support|Ubuntu 22.04|
|Windows|Ongoing| |
## Intel GPU
|Intel GPU| Status | Verified Model|
|-|-|-|
|Intel Data Center Max Series| Support| Max 1550|
|Intel Data Center Flex Series| Support| Flex 170|
a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
Recommend to install to default folder: **/opt/intel/oneapi**.
Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
b. Check
```
source /opt/intel/oneapi/setvars.sh
sycl-ls
```
There should be one or more level-zero devices. Like **[ext_oneapi_level_zero:gpu:0]**.
max compute_units 512, max work group size 1024, max sub group size 32, global mem size 16225243136
```
|Attribute|Note|
|-|-|
|compute capability 1.3|Level-zero running time, recommended |
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
4. Set device ID and execute llama.cpp
Set device ID = 0 by **GGML_SYCL_DEVICE=0**
```
GGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
```
or run by script:
```
./examples/sycl/run_llama2.sh
```
Note:
- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.
5. Check the device ID in output
Like:
```
Using device **0** (Intel(R) Arc(TM) A770 Graphics) as main device
```
## Environment Variable
#### Build
|Name|Value|Function|
|-|-|-|
|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
|CMAKE_CXX_COMPILER|icpx|use icpx for SYCL code path|
#### Running
|Name|Value|Function|
|-|-|-|
|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
## Known Issue
- Error: `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.
Miss to enable oneAPI running environment.
Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.
- Hang during startup
llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
./build/bin/main -m models/llama_7b_q4_0.gguf -n 128 --prompt "Once upon a time"
```
## Benchmark
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} -DLLAMA_CUBLAS=1 .. ) 2>&1| tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1| tee -a $OUT/${ci}-make.log
python3 ../convert.py ${path_models}
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"
model_q4_0="${path_models}/ggml-model-q4_0.gguf"
model_q4_1="${path_models}/ggml-model-q4_1.gguf"
model_q5_0="${path_models}/ggml-model-q5_0.gguf"
model_q5_1="${path_models}/ggml-model-q5_1.gguf"
model_q2_k="${path_models}/ggml-model-q2_k.gguf"
model_q3_k="${path_models}/ggml-model-q3_k.gguf"
model_q4_k="${path_models}/ggml-model-q4_k.gguf"
model_q5_k="${path_models}/ggml-model-q5_k.gguf"
model_q6_k="${path_models}/ggml-model-q6_k.gguf"
wiki_test="${path_wiki}/wiki.test.raw"
./bin/quantize ${model_f16}${model_q8_0} q8_0
./bin/quantize ${model_f16}${model_q4_0} q4_0
./bin/quantize ${model_f16}${model_q4_1} q4_1
./bin/quantize ${model_f16}${model_q5_0} q5_0
./bin/quantize ${model_f16}${model_q5_1} q5_1
./bin/quantize ${model_f16}${model_q2_k} q2_k
./bin/quantize ${model_f16}${model_q3_k} q3_k
./bin/quantize ${model_f16}${model_q4_k} q4_k
./bin/quantize ${model_f16}${model_q5_k} q5_k
./bin/quantize ${model_f16}${model_q6_k} q6_k
(time ./bin/main --model ${model_f16} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/main --model ${model_q8_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/main --model ${model_q4_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/main --model ${model_q4_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/main --model ${model_q5_0} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/main --model ${model_q5_1} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/main --model ${model_q2_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/main --model ${model_q3_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/main --model ${model_q4_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/main --model ${model_q5_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/main --model ${model_q6_k} -t 1 -ngl 999 -s 1234 -n 256 --ignore-eos -p "I believe the meaning of life is") 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/perplexity --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-f16.log
(time ./bin/perplexity --model ${model_q8_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q8_0.log
(time ./bin/perplexity --model ${model_q4_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_0.log
(time ./bin/perplexity --model ${model_q4_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_1.log
(time ./bin/perplexity --model ${model_q5_0} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_0.log
(time ./bin/perplexity --model ${model_q5_1} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_1.log
(time ./bin/perplexity --model ${model_q2_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q2_k.log
(time ./bin/perplexity --model ${model_q3_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q3_k.log
(time ./bin/perplexity --model ${model_q4_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q4_k.log
(time ./bin/perplexity --model ${model_q5_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q5_k.log
(time ./bin/perplexity --model ${model_q6_k} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-tg-q6_k.log
(time ./bin/imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 999 -c 2048 -b 512 --chunks 4) 2>&1| tee -a $OUT/${ci}-imatrix.log
(time ./bin/save-load-state --model ${model_q4_0}) 2>&1| tee -a $OUT/${ci}-save-load-state.log
message(WARNING"Git repository not found; to enable automatic generation of build info, make sure Git is installed and the project is a Git repository.")
set(GIT_INDEX"")
endif()
# Add a custom command to rebuild build-info.cpp when .git/index changes
## Verifying that the model is running on the GPU with cuBLAS
Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
```shell
./main -m "path/to/model.bin" -ngl 200000 -p "Please sir, may I have some "
./main -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some "
```
When running llama, before it starts the inference work, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:
If you see these lines, then the GPU is being used.
## Verifying that the CPU is not oversaturated
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physicial CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physical CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
# Example of runtime flags effect on inference speed benchmark
Run command: `./main -m "path/to/model.bin" -p "-p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]`
Run command: `./main -m "path/to/model.gguf" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]`
# Uncomment and adjust to the number of CPU cores you want to use.
#N_THREAD="${N_THREAD:-4}"
CTX_SIZE="${CTX_SIZE:-4096}"
N_PREDICTS="${N_PREDICTS:-4096}"
GEN_OPTIONS=(--batch_size 1024
--ctx_size 2048
--ctx_size "$CTX_SIZE"
--keep -1
--repeat_last_n 256
--repeat_penalty 1.17647
--temp 0.7
--top_k 40
--top_p 0.5)
--temp 0.6
--mirostat 2)
if[ -n "$N_THREAD"];then
GEN_OPTIONS+=(--threads "$N_THREAD")
@@ -24,16 +24,17 @@ fi
./main "${GEN_OPTIONS[@]}"\
--model "$MODEL"\
--in-prefix " "\
--in-suffix "${AI_NAME}:"\
--n_predict "$N_PREDICTS"\
--color --interactive \
--reverse-prompt "${USER_NAME}:"\
--prompt "
This is a transcript of a 1000 page, never ending conversation between ${USER_NAME} and the cute and helpful AI assistant ${AI_NAME}. ${AI_NAME} is a girl who is an AI running on the user's computer.
--prompt "This is a transcript of a 1000 page, never ending conversation between ${USER_NAME} and the cute and helpful AI assistant ${AI_NAME}. ${AI_NAME} is a girl who is an AI running on the user's computer.
${AI_NAME} can think for herself without the user seeing her thoughts by adding a /think prefix to her output. She uses this to reason about the world and to think about what she should say next.
${AI_NAME} is always coherent and makes sense, but if she isn't sure if what she is saying is correct, she will ask the user for help.
${AI_NAME} is a very helpful AI and will help the user with anything they need. She is also very friendly and will try to make the user feel better if they are sad.
${AI_NAME} is also very curious and will ask the user a lot of questions about themselves and their life. She will also try to make the user like her.
The conversation is only between ${USER_NAME} and ${AI_NAME}
The conversation is only between ${USER_NAME} and ${AI_NAME}.
The conversation is only through text, so ${AI_NAME} can't see ${USER_NAME}'s face or hear his voice.
${AI_NAME} can only communicate through text, so she can't send images or videos.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.