Compare commits

..

27 Commits

Author SHA1 Message Date
Georgi Gerganov
f55b647300 llama : minor indentation during tensor loading
ggml-ci
2024-07-04 19:34:04 +03:00
Francis Couture-Harpin
18e92879d5 llama : fix t5 uses of n_head and n_ff 2024-07-04 11:52:48 -04:00
Francis Couture-Harpin
c6ac198424 Merge branch 'master' into openelm 2024-07-04 11:45:21 -04:00
Francis Couture-Harpin
269e07bb00 llama : use const ref for print_f and fix division by zero 2024-07-04 11:39:32 -04:00
ditsuke
51d2ebadbb build: Export hf-to-gguf as snakecase 2024-07-04 15:39:13 +00:00
ditsuke
1e920018d3 doc: Add context for why we add an explicit pytorch source 2024-07-04 15:39:13 +00:00
ditsuke
01a5f06550 chore: Remove rebase artifacts 2024-07-04 15:39:13 +00:00
ditsuke
07786a61a2 chore: Fixup requirements and build 2024-07-04 15:39:13 +00:00
ditsuke
de14e2ea2b chore: ignore all __pychache__ 2024-07-04 15:39:13 +00:00
ditsuke
821922916f fix: Update script paths in CI scripts 2024-07-04 15:39:13 +00:00
ditsuke
b1c3f26e5e fix: Actually include scripts in build
Not namespaced though :(
2024-07-04 15:39:13 +00:00
ditsuke
b0a46993df build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
Georgi Gerganov
199d0fb0c9 Merge branch 'master' into pr/7359 2024-07-04 18:25:16 +03:00
Georgi Gerganov
3fe395d220 llama : handle n_head == 0 2024-07-04 18:23:17 +03:00
fairydreaming
807b0c49ff Inference support for T5 and FLAN-T5 model families (#5763)
* llama : add inference support and model types for T5 and FLAN-T5 model families

* llama : add new API functions to support encoder-decoder models: llama_encode(), llama_model_has_encoder(), llama_model_decoder_start_token()

* common, llama-cli, llama-batched : add support for encoder-decoder models

* convert-hf : handle shared token embeddings tensors in T5Model

* convert-hf : add support for SentencePiece BPE tokenizer in T5Model (for Pile-T5 models)

* convert-hf : add MT5ForConditionalGeneration and UMT5ForConditionalGeneration to architectures supported by T5Model

* convert : add t5 tokenizer tests, use "slow" HF tokenizer for t5

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-04 15:46:11 +02:00
Georgi Gerganov
22a648f8cc Merge branch 'master' into pr/7359 2024-07-04 16:42:13 +03:00
Georgi Gerganov
9971c38ada llama : do not print hparams for vocab-only models 2024-07-04 16:39:02 +03:00
Georgi Gerganov
b59ddf945e llama : fix save/load state 2024-07-04 15:55:23 +03:00
Georgi Gerganov
29ab5a0ed1 llama : use std::array for per-layer hparams 2024-07-04 15:35:15 +03:00
Daniel Bevenius
f8c4c0738d tests : add _CRT_SECURE_NO_WARNINGS for WIN32 (#8231)
This commit adds the compile definition `_CRT_SECURE_NO_WARNINGS`
to the root cmake subproject.

The motivation for this is that currently the following warnings are
displayed when compiling the tests and common cmake subprojects:
```console
test-llama-grammar.cpp
C:\llama.cpp\src\.\llama.cpp(1406,77): warning C4996: 'strerror':
This function or variable may be unsafe. Consider using strerror_s
instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
[C:\llama.cpp\build\tests\test-llama-grammar.vcxproj]
...
```

This compile definition is currently set for the `src` subproject
and this change moves into the root cmake project so that it is applied
to all cmake subprojects.
2024-07-04 13:53:42 +03:00
Daniel Bevenius
402d6feffa llama : suppress unref var in Windows MSVC (#8150)
* llama : suppress unref var in Windows MSVC

This commit suppresses two warnings that are currently generated for
src/llama.cpp when building on Windows MSVC

```console
C:\llama.cpp\src\llama.cpp(14349,45): warning C4101: 'ex':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
C:\llama.cpp\src\llama.cpp(19285,44): warning C4101: 'e':
unreferenced local variable [C:\llama.cpp\build\src\llama.vcxproj]
```

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-04 13:50:57 +03:00
compilade
e3e33c0cbc llama : minor spacing changes
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-07-01 15:23:02 -04:00
Francis Couture-Harpin
c8cdb48d10 llama : support all OpenELM models
* llama : add variable GQA and variable FFN sizes

Some metadata keys can now also be arrays to support setting
their value per-layer for models like OpenELM.
2024-06-30 23:14:01 -04:00
Francis Couture-Harpin
51b2577dd4 Merge branch 'master' into openelm 2024-06-30 16:22:07 -04:00
Icecream95
60b2e1b9c5 fixup! Initial OpenELM support (270M only so far)
Fix formatting
2024-05-18 20:19:10 +12:00
Icecream95
aaabe2e361 Fill out missing entries in llama_model_type_name 2024-05-18 19:56:14 +12:00
Icecream95
217d8d7b77 Initial OpenELM support (270M only so far) 2024-05-18 19:41:42 +12:00
49 changed files with 3006 additions and 336 deletions

11
.gitignore vendored
View File

@@ -98,13 +98,14 @@ examples/server/*.mjs.hpp
# Python
__pycache__
.venv
/Pipfile
dist
poetry.lock
/.venv
__pycache__/
*/poetry.lock
poetry.toml
# Nix
/result
# Test binaries
/tests/test-backend-ops
/tests/test-double-float

View File

@@ -42,6 +42,10 @@ endif()
option(BUILD_SHARED_LIBS "build shared libraries" ${BUILD_SHARED_LIBS_DEFAULT})
if (WIN32)
add_compile_definitions(_CRT_SECURE_NO_WARNINGS)
endif()
#
# option list
#
@@ -152,7 +156,7 @@ install(FILES ${CMAKE_CURRENT_BINARY_DIR}/llama-config.cmake
DESTINATION ${CMAKE_INSTALL_LIBDIR}/cmake/llama)
install(
FILES convert-hf-to-gguf.py
FILES convert_hf_to_gguf.py
PERMISSIONS
OWNER_READ
OWNER_WRITE

View File

@@ -688,7 +688,7 @@ function gg_run_embd_bge_small {
(time cmake -DCMAKE_BUILD_TYPE=Release ${CMAKE_EXTRA} .. ) 2>&1 | tee -a $OUT/${ci}-cmake.log
(time make -j ) 2>&1 | tee -a $OUT/${ci}-make.log
python3 ../convert-hf-to-gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
python3 ../convert_hf_to_gguf.py ${path_models} --outfile ${path_models}/ggml-model-f16.gguf
model_f16="${path_models}/ggml-model-f16.gguf"
model_q8_0="${path_models}/ggml-model-q8_0.gguf"

View File

@@ -2070,7 +2070,24 @@ std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_par
if (params.warmup) {
LOG("warming up the model with an empty run\n");
std::vector<llama_token> tmp = { llama_token_bos(model), llama_token_eos(model), };
std::vector<llama_token> tmp;
llama_token bos = llama_token_bos(model);
llama_token eos = llama_token_eos(model);
// some models (e.g. T5) don't have a BOS token
if (bos != -1) {
tmp.push_back(bos);
}
tmp.push_back(eos);
if (llama_model_has_encoder(model)) {
llama_encode(lctx, llama_batch_get_one(tmp.data(), tmp.size(), 0, 0));
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
if (decoder_start_token_id == -1) {
decoder_start_token_id = bos;
}
tmp.clear();
tmp.push_back(decoder_start_token_id);
}
llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
llama_kv_cache_clear(lctx);
llama_synchronize(lctx);

View File

@@ -13,7 +13,7 @@ import sys
from enum import IntEnum
from pathlib import Path
from hashlib import sha256
from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Sequence, TypeVar, cast
from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
import math
import numpy as np
@@ -677,6 +677,51 @@ class Model:
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
special_vocab.add_to_gguf(self.gguf_writer)
def _set_vocab_builtin(self, model_name: Literal["gpt-neox", "llama-spm"], vocab_size: int):
tokenizer_path = Path(sys.path[0]) / "models" / f"ggml-vocab-{model_name}.gguf"
logger.warning(f"Using tokenizer from '{os.path.relpath(tokenizer_path, os.getcwd())}'")
vocab_reader = gguf.GGUFReader(tokenizer_path, "r")
default_pre = "mpt" if model_name == "gpt-neox" else "default"
field = vocab_reader.get_field(gguf.Keys.Tokenizer.MODEL)
assert field # tokenizer model
self.gguf_writer.add_tokenizer_model(bytes(field.parts[-1]).decode("utf-8"))
field = vocab_reader.get_field(gguf.Keys.Tokenizer.PRE)
self.gguf_writer.add_tokenizer_pre(bytes(field.parts[-1]).decode("utf-8") if field else default_pre)
field = vocab_reader.get_field(gguf.Keys.Tokenizer.LIST)
assert field # token list
self.gguf_writer.add_token_list([bytes(field.parts[i]) for i in field.data][:vocab_size])
if model_name == "llama-spm":
field = vocab_reader.get_field(gguf.Keys.Tokenizer.SCORES)
assert field # token scores
self.gguf_writer.add_token_scores([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
field = vocab_reader.get_field(gguf.Keys.Tokenizer.TOKEN_TYPE)
assert field # token types
self.gguf_writer.add_token_types([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
if model_name != "llama-spm":
field = vocab_reader.get_field(gguf.Keys.Tokenizer.MERGES)
assert field # token merges
self.gguf_writer.add_token_merges([bytes(field.parts[i]) for i in field.data])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.BOS_ID)) is not None:
self.gguf_writer.add_bos_token_id(field.parts[-1].tolist()[0])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.EOS_ID)) is not None:
self.gguf_writer.add_eos_token_id(field.parts[-1].tolist()[0])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.UNK_ID)) is not None:
self.gguf_writer.add_unk_token_id(field.parts[-1].tolist()[0])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.PAD_ID)) is not None:
self.gguf_writer.add_pad_token_id(field.parts[-1].tolist()[0])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.ADD_BOS)) is not None:
self.gguf_writer.add_add_bos_token(field.parts[-1].tolist()[0])
if (field := vocab_reader.get_field(gguf.Keys.Tokenizer.ADD_EOS)) is not None:
self.gguf_writer.add_add_eos_token(field.parts[-1].tolist()[0])
@Model.register("GPTNeoXForCausalLM")
class GPTNeoXModel(Model):
@@ -2439,39 +2484,7 @@ class MambaModel(Model):
self._set_vocab_sentencepiece()
else:
# Use the GPT-NeoX tokenizer when no tokenizer files are present
tokenizer_path = Path(sys.path[0]) / "models" / "ggml-vocab-gpt-neox.gguf"
logger.warning(f"Using tokenizer from '{os.path.relpath(tokenizer_path, os.getcwd())}'")
neox_reader = gguf.GGUFReader(tokenizer_path, "r")
field = neox_reader.get_field(gguf.Keys.Tokenizer.MODEL)
self.gguf_writer.add_tokenizer_model(bytes(field.parts[-1]).decode("utf-8") if field else "gpt2")
field = neox_reader.get_field(gguf.Keys.Tokenizer.PRE)
self.gguf_writer.add_tokenizer_pre(bytes(field.parts[-1]).decode("utf-8") if field else "mpt")
field = neox_reader.get_field(gguf.Keys.Tokenizer.LIST)
assert field
self.gguf_writer.add_token_list([bytes(field.parts[i]) for i in field.data][:vocab_size])
field = neox_reader.get_field(gguf.Keys.Tokenizer.TOKEN_TYPE)
assert field
self.gguf_writer.add_token_types([field.parts[i].tolist()[0] for i in field.data][:vocab_size])
field = neox_reader.get_field(gguf.Keys.Tokenizer.MERGES)
assert field
self.gguf_writer.add_token_merges([bytes(field.parts[i]) for i in field.data])
field = neox_reader.get_field(gguf.Keys.Tokenizer.BOS_ID)
self.gguf_writer.add_bos_token_id(field.parts[-1].tolist()[0] if field else 1)
field = neox_reader.get_field(gguf.Keys.Tokenizer.EOS_ID)
self.gguf_writer.add_eos_token_id(field.parts[-1].tolist()[0] if field else 0)
field = neox_reader.get_field(gguf.Keys.Tokenizer.UNK_ID)
self.gguf_writer.add_unk_token_id(field.parts[-1].tolist()[0] if field else 0)
field = neox_reader.get_field(gguf.Keys.Tokenizer.PAD_ID)
self.gguf_writer.add_pad_token_id(field.parts[-1].tolist()[0] if field else 0)
self._set_vocab_builtin("gpt-neox", vocab_size)
def set_gguf_parameters(self):
d_model = self.find_hparam(["hidden_size", "d_model"])
@@ -2623,6 +2636,82 @@ class JinaBertV2Model(BertModel):
self.gguf_writer.add_add_eos_token(True)
@Model.register("OpenELMForCausalLM")
class OpenELMModel(Model):
model_arch = gguf.MODEL_ARCH.OPENELM
@staticmethod
def _make_divisible(v: float | int, divisor: int) -> int:
# ref: https://huggingface.co/apple/OpenELM-270M-Instruct/blob/eb111ff2e6724348e5b905984063d4064d4bc579/configuration_openelm.py#L34-L38
new_v = max(divisor, int(v + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_v < 0.9 * v:
new_v += divisor
return new_v
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
ffn_multipliers: list[float] = self.hparams["ffn_multipliers"]
ffn_dim_divisor: int = self.hparams["ffn_dim_divisor"]
self._n_embd: int = self.hparams["model_dim"]
self._num_kv_heads: list[int] = self.hparams["num_kv_heads"]
self._num_query_heads: list[int] = self.hparams["num_query_heads"]
self._ffn_dims: list[int] = [
OpenELMModel._make_divisible(multiplier * self._n_embd, ffn_dim_divisor)
for multiplier in ffn_multipliers
]
assert isinstance(self._num_kv_heads, list) and isinstance(self._num_kv_heads[0], int)
assert isinstance(self._num_query_heads, list) and isinstance(self._num_query_heads[0], int)
# Uses the tokenizer from meta-llama/Llama-2-7b-hf
def set_vocab(self):
try:
self._set_vocab_sentencepiece()
except FileNotFoundError:
self._set_vocab_builtin("llama-spm", self.hparams["vocab_size"])
def set_gguf_parameters(self):
n_embd = self._n_embd
head_dim = self.hparams["head_dim"]
rot_pct = 1.0
assert self.block_count == len(self._num_kv_heads)
assert self.block_count == len(self._num_query_heads)
assert self.block_count == len(self._ffn_dims)
self.gguf_writer.add_name(self.dir_model.name if self.model_name is None else self.model_name)
self.gguf_writer.add_block_count(self.block_count)
self.gguf_writer.add_context_length(self.hparams["max_context_length"])
self.gguf_writer.add_embedding_length(n_embd)
self.gguf_writer.add_feed_forward_length(self._ffn_dims)
self.gguf_writer.add_head_count(self._num_query_heads)
self.gguf_writer.add_head_count_kv(self._num_kv_heads)
self.gguf_writer.add_rope_freq_base(self.hparams["rope_freq_constant"])
# https://huggingface.co/apple/OpenELM-270M-Instruct/blob/c401df2/modeling_openelm.py#L30
self.gguf_writer.add_layer_norm_rms_eps(1e-6)
self.gguf_writer.add_rope_dimension_count(int(rot_pct * head_dim))
self.gguf_writer.add_key_length(head_dim)
self.gguf_writer.add_value_length(head_dim)
self.gguf_writer.add_file_type(self.ftype)
def find_hparam(self, keys: Iterable[str], optional: bool = False) -> Any:
if "n_layers" in keys:
return self.hparams["num_transformer_layers"]
return super().find_hparam(keys, optional)
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# split ff
if bid is not None and name == f"transformer.layers.{bid}.ffn.proj_1.weight":
ff_dim = self._ffn_dims[bid]
yield (self.format_tensor_name(gguf.MODEL_TENSOR.FFN_GATE, bid), data_torch[:ff_dim])
yield (self.format_tensor_name(gguf.MODEL_TENSOR.FFN_UP, bid), data_torch[ff_dim:])
return
yield (self.map_tensor_name(name), data_torch)
@Model.register("ArcticForCausalLM")
class ArcticModel(Model):
model_arch = gguf.MODEL_ARCH.ARCTIC
@@ -2853,11 +2942,17 @@ class DeepseekV2Model(Model):
raise ValueError(f"Unprocessed experts: {experts}")
@Model.register("T5ForConditionalGeneration")
@Model.register("T5WithLMHeadModel")
@Model.register("T5ForConditionalGeneration")
@Model.register("MT5ForConditionalGeneration")
@Model.register("UMT5ForConditionalGeneration")
class T5Model(Model):
model_arch = gguf.MODEL_ARCH.T5
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.shared_token_embeddings_found = False
def set_vocab(self):
# to avoid TypeError: Descriptors cannot be created directly
# exception when importing sentencepiece_model_pb2
@@ -2865,17 +2960,29 @@ class T5Model(Model):
from sentencepiece import SentencePieceProcessor
from sentencepiece import sentencepiece_model_pb2 as model
tokenizer_path = self.dir_model / 'spiece.model'
tokenizer_path = self.dir_model / 'tokenizer.model'
# many older models use spiece.model tokenizer model filename
if not tokenizer_path.is_file():
tokenizer_path = self.dir_model / 'spiece.model'
if not tokenizer_path.is_file():
raise FileNotFoundError(f"File not found: {tokenizer_path}")
sentencepiece_model = model.ModelProto()
sentencepiece_model.ParseFromString(open(tokenizer_path, "rb").read())
# some models like Pile-T5 family use BPE tokenizer instead of Unigram
if sentencepiece_model.trainer_spec.model_type == 2: # BPE
# assure the tokenizer model file name is correct
assert tokenizer_path.name == 'tokenizer.model'
return self._set_vocab_sentencepiece()
else:
assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
add_prefix = sentencepiece_model.normalizer_spec.add_dummy_prefix
remove_whitespaces = sentencepiece_model.normalizer_spec.remove_extra_whitespaces
precompiled_charsmap = sentencepiece_model.normalizer_spec.precompiled_charsmap
assert sentencepiece_model.trainer_spec.model_type == 1 # UNIGRAM
tokenizer = SentencePieceProcessor()
tokenizer.LoadFromFile(str(tokenizer_path))
@@ -2945,7 +3052,10 @@ class T5Model(Model):
def set_gguf_parameters(self):
self.gguf_writer.add_name("T5")
self.gguf_writer.add_context_length(self.hparams["n_positions"])
if (n_ctx := self.find_hparam(["n_positions"], optional=True)) is None:
logger.warning("Couldn't find context length in config.json, assuming default value of 512")
n_ctx = 512
self.gguf_writer.add_context_length(n_ctx)
self.gguf_writer.add_embedding_length(self.hparams["d_model"])
self.gguf_writer.add_feed_forward_length(self.hparams["d_ff"])
self.gguf_writer.add_block_count(self.hparams["num_layers"])
@@ -2961,12 +3071,17 @@ class T5Model(Model):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
del bid # unused
# Sometimes T5 and Flan-T5 based models contain "encoder.embed_tokens.weight" tensor or
# "decoder.embed_tokens.weight" tensors that are duplicates of "shared.weight" tensor
# To prevent errors caused by an unnecessary unmapped tensor, skip both of them and use only "shared.weight".
if name == "decoder.embed_tokens.weight" or name == "encoder.embed_tokens.weight":
logger.debug(f"Skipping tensor {name!r} in safetensors so that convert can end normally.")
return []
# T5 based models contain shared token embeddings tensors saved randomly as either "encoder.embed_tokens.weight",
# "decoder.embed_tokens.weight" or "shared.weight" tensor. In some models there are even multiple of them stored
# in the safetensors files. We use the first tensor from these three as the token embeddings for both encoder
# and decoder and ignore the remaining ones.
if name in ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight", "shared.weight"]:
if not self.shared_token_embeddings_found:
name = "shared.weight"
self.shared_token_embeddings_found = True
else:
logger.debug(f"Skipping shared tensor {name!r} in safetensors so that convert can end normally.")
return []
return [(self.map_tensor_name(name), data_torch)]

View File

@@ -45,6 +45,7 @@ class TOKENIZER_TYPE(IntEnum):
SPM = auto()
BPE = auto()
WPM = auto()
UGM = auto()
# TODO: this string has to exercise as much pre-tokenizer functionality as possible
@@ -89,6 +90,7 @@ models = [
{"name": "gemma", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2b", },
{"name": "gemma-2", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/google/gemma-2-9b", },
{"name": "jais", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/core42/jais-13b", },
{"name": "t5", "tokt": TOKENIZER_TYPE.UGM, "repo": "https://huggingface.co/google-t5/t5-small", },
]
@@ -110,9 +112,13 @@ def download_model(model):
os.makedirs(f"models/tokenizers/{name}", exist_ok=True)
files = ["config.json", "tokenizer.json", "tokenizer_config.json"]
if tokt == TOKENIZER_TYPE.SPM:
files.append("tokenizer.model")
if tokt == TOKENIZER_TYPE.UGM:
files.append("spiece.model")
for file in files:
save_path = f"models/tokenizers/{name}/{file}"
if os.path.isfile(save_path):
@@ -135,7 +141,7 @@ for model in models:
name = model["name"]
tokt = model["tokt"]
if tokt == TOKENIZER_TYPE.SPM:
if tokt == TOKENIZER_TYPE.SPM or tokt == TOKENIZER_TYPE.UGM:
continue
# Skip if the tokenizer folder does not exist or there are other download issues previously
@@ -145,7 +151,10 @@ for model in models:
# create the tokenizer
try:
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
if name == "t5":
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
else:
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
except OSError as e:
logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}")
continue # Skip to the next model if the tokenizer can't be loaded
@@ -266,6 +275,7 @@ tests = [
"\n =",
"' era",
"Hello, y'all! How are you 😁 ?我想在apple工作1314151天",
"!!!!!!",
"3",
"33",
"333",
@@ -304,7 +314,10 @@ for model in models:
# create the tokenizer
try:
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
if name == "t5":
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False)
else:
tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
except OSError as e:
logger.error(f"Failed to load tokenizer for model {name}. Error: {e}")
continue # Skip this model and continue with the next one in the loop

View File

@@ -93,14 +93,34 @@ int main(int argc, char ** argv) {
// create a llama_batch
// we use this object to submit token data for decoding
llama_batch batch = llama_batch_init(std::max(tokens_list.size(), (size_t)n_parallel), 0, 1);
llama_batch batch = llama_batch_init(std::max(tokens_list.size(), (size_t) n_parallel), 0, n_parallel);
std::vector<llama_seq_id> seq_ids(n_parallel, 0);
for (int32_t i = 0; i < n_parallel; ++i) {
seq_ids[i] = i;
}
// evaluate the initial prompt
for (size_t i = 0; i < tokens_list.size(); ++i) {
llama_batch_add(batch, tokens_list[i], i, { 0 }, false);
llama_batch_add(batch, tokens_list[i], i, seq_ids, false);
}
GGML_ASSERT(batch.n_tokens == (int) tokens_list.size());
if (llama_model_has_encoder(model)) {
if (llama_encode(ctx, batch)) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
if (decoder_start_token_id == -1) {
decoder_start_token_id = llama_token_bos(model);
}
llama_batch_clear(batch);
llama_batch_add(batch, decoder_start_token_id, 0, seq_ids, false);
}
// llama_decode will output logits only for the last token of the prompt
batch.logits[batch.n_tokens - 1] = true;
@@ -109,11 +129,11 @@ int main(int argc, char ** argv) {
return 1;
}
// assign the system KV cache to all parallel sequences
// this way, the parallel sequences will "reuse" the prompt tokens without having to copy them
for (int32_t i = 1; i < n_parallel; ++i) {
llama_kv_cache_seq_cp(ctx, 0, i, -1, -1);
}
//// assign the system KV cache to all parallel sequences
//// this way, the parallel sequences will "reuse" the prompt tokens without having to copy them
//for (int32_t i = 1; i < n_parallel; ++i) {
// llama_kv_cache_seq_cp(ctx, 0, i, -1, -1);
//}
if (n_parallel > 1) {
LOG_TEE("\n\n%s: generating %d sequences ...\n", __func__, n_parallel);

View File

@@ -1,3 +1,3 @@
-r ../../requirements/requirements-convert-legacy-llama.txt
pillow~=10.2.0
torch~=2.1.1
torch~=2.2.1

View File

@@ -255,7 +255,9 @@ int main(int argc, char ** argv) {
}
const bool add_bos = llama_should_add_bos_token(model);
GGML_ASSERT(llama_add_eos_token(model) != 1);
if (!llama_model_has_encoder(model)) {
GGML_ASSERT(llama_add_eos_token(model) != 1);
}
LOG("add_bos: %d\n", add_bos);
std::vector<llama_token> embd_inp;
@@ -517,6 +519,24 @@ int main(int argc, char ** argv) {
exit(1);
}
if (llama_model_has_encoder(model)) {
int enc_input_size = embd_inp.size();
llama_token * enc_input_buf = embd_inp.data();
if (llama_encode(ctx, llama_batch_get_one(enc_input_buf, enc_input_size, 0, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
llama_token decoder_start_token_id = llama_model_decoder_start_token(model);
if (decoder_start_token_id == -1) {
decoder_start_token_id = llama_token_bos(model);
}
embd_inp.clear();
embd_inp.push_back(decoder_start_token_id);
}
while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
// predict
if (!embd.empty()) {

View File

@@ -160,6 +160,7 @@ class MODEL_ARCH(IntEnum):
COMMAND_R = auto()
DBRX = auto()
OLMO = auto()
OPENELM = auto()
ARCTIC = auto()
DEEPSEEK2 = auto()
BITNET = auto()
@@ -285,6 +286,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
MODEL_ARCH.COMMAND_R: "command-r",
MODEL_ARCH.DBRX: "dbrx",
MODEL_ARCH.OLMO: "olmo",
MODEL_ARCH.OPENELM: "openelm",
MODEL_ARCH.ARCTIC: "arctic",
MODEL_ARCH.DEEPSEEK2: "deepseek2",
MODEL_ARCH.BITNET: "bitnet",
@@ -861,6 +863,19 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.OPENELM: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.ATTN_QKV,
MODEL_TENSOR.ATTN_Q_NORM,
MODEL_TENSOR.ATTN_K_NORM,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.FFN_GATE,
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.ARCTIC: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,

View File

@@ -480,8 +480,11 @@ class GGUFWriter:
def add_leading_dense_block_count(self, length: int) -> None:
self.add_uint32(Keys.LLM.LEADING_DENSE_BLOCK_COUNT.format(arch=self.arch), length)
def add_feed_forward_length(self, length: int) -> None:
self.add_uint32(Keys.LLM.FEED_FORWARD_LENGTH.format(arch=self.arch), length)
def add_feed_forward_length(self, length: int | Sequence[int]) -> None:
if isinstance(length, int):
self.add_uint32(Keys.LLM.FEED_FORWARD_LENGTH.format(arch=self.arch), length)
else:
self.add_array(Keys.LLM.FEED_FORWARD_LENGTH.format(arch=self.arch), length)
def add_expert_feed_forward_length(self, length: int) -> None:
self.add_uint32(Keys.LLM.EXPERT_FEED_FORWARD_LENGTH.format(arch=self.arch), length)
@@ -495,11 +498,17 @@ class GGUFWriter:
def add_decoder_start_token_id(self, id: int) -> None:
self.add_uint32(Keys.LLM.DECODER_START_TOKEN_ID.format(arch=self.arch), id)
def add_head_count(self, count: int) -> None:
self.add_uint32(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count)
def add_head_count(self, count: int | Sequence[int]) -> None:
if isinstance(count, int):
self.add_uint32(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count)
else:
self.add_array(Keys.Attention.HEAD_COUNT.format(arch=self.arch), count)
def add_head_count_kv(self, count: int) -> None:
self.add_uint32(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count)
def add_head_count_kv(self, count: int | Sequence[int]) -> None:
if isinstance(count, int):
self.add_uint32(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count)
else:
self.add_array(Keys.Attention.HEAD_COUNT_KV.format(arch=self.arch), count)
def add_key_length(self, length: int) -> None:
self.add_uint32(Keys.Attention.KEY_LENGTH.format(arch=self.arch), length)

View File

@@ -24,6 +24,7 @@ class TensorNameMap:
"backbone.embedding", # mamba
"backbone.embeddings", # mamba-hf
"transformer.in_out_embed", # Grok
"transformer.token_embeddings", # openelm
"shared", # t5
),
@@ -37,6 +38,7 @@ class TensorNameMap:
"word_embeddings_layernorm", # bloom
"embeddings.LayerNorm", # bert
"emb_ln", # nomic-bert
"transformer.norm", # openelm
),
# Position embeddings
@@ -69,6 +71,7 @@ class TensorNameMap:
"model.norm_f", # mamba-qbert
"backbone.norm_f", # mamba
"transformer.rms_norm", # Grok
"transformer.norm", # openelm
),
# Rope frequencies
@@ -98,6 +101,7 @@ class TensorNameMap:
"backbone.layers.{bid}.norm", # mamba
"transformer.decoder_layer.{bid}.rms_norm", # Grok
"transformer.blocks.{bid}.norm_attn_norm.norm_1", # dbrx
"transformer.layers.{bid}.attn_norm", # openelm
),
# Attention norm 2
@@ -119,7 +123,8 @@ class TensorNameMap:
"h.{bid}.attn.c_attn", # gpt2
"transformer.h.{bid}.mixer.Wqkv", # phi2
"encoder.layers.{bid}.attn.Wqkv", # nomic-bert
"model.layers.{bid}.self_attn.qkv_proj" # phi3
"model.layers.{bid}.self_attn.qkv_proj", # phi3
"transformer.layers.{bid}.attn.qkv_proj", # openelm
),
# Attention query
@@ -177,6 +182,7 @@ class TensorNameMap:
"encoder.layers.{bid}.attn.out_proj", # nomic-bert
"transformer.decoder_layer.{bid}.multi_head_attention.linear", # Grok
"transformer.blocks.{bid}.norm_attn_norm.attn.out_proj", # dbrx
"transformer.layers.{bid}.attn.out_proj", # openelm
),
# Attention output norm
@@ -212,6 +218,7 @@ class TensorNameMap:
"h.{bid}.ln_2", # gpt2
"model.layers.{bid}.ffn_norm", # internlm2
"transformer.decoder_layer.{bid}.rms_norm_2", # Grok
"transformer.layers.{bid}.ffn_norm", # openelm
),
# Post feed-forward norm
@@ -327,6 +334,7 @@ class TensorNameMap:
"encoder.layers.{bid}.mlp.fc2", # nomic-bert
"model.layers.{bid}.mlp.c_proj", # starcoder2
"encoder.layer.{bid}.mlp.wo", # jina-bert-v2
"transformer.layers.{bid}.ffn.proj_2", # openelm
"model.layers.{bid}.residual_mlp.w2", # arctic
"encoder.layer.{bid}.mlp.down_layer", # jina-bert-v2
),
@@ -348,7 +356,8 @@ class TensorNameMap:
"model.layers.{bid}.self_attn.q_layernorm", # persimmon
"model.layers.{bid}.self_attn.q_norm", # cohere
"transformer.blocks.{bid}.attn.q_ln", # sea-lion
"encoder.layer.{bid}.attention.self.layer_norm_q" # jina-bert-v2
"encoder.layer.{bid}.attention.self.layer_norm_q", # jina-bert-v2
"transformer.layers.{bid}.attn.q_norm", # openelm
),
MODEL_TENSOR.ATTN_K_NORM: (
@@ -356,7 +365,8 @@ class TensorNameMap:
"model.layers.{bid}.self_attn.k_layernorm", # persimmon
"model.layers.{bid}.self_attn.k_norm", # cohere
"transformer.blocks.{bid}.attn.k_ln", # sea-lion
"encoder.layer.{bid}.attention.self.layer_norm_k" # jina-bert-v2
"encoder.layer.{bid}.attention.self.layer_norm_k", # jina-bert-v2
"transformer.layers.{bid}.attn.k_norm", # openelm
),
MODEL_TENSOR.ROPE_FREQS: (

View File

@@ -485,6 +485,13 @@ extern "C" {
// Get a llama model tensor
LLAMA_API struct ggml_tensor * llama_get_model_tensor(struct llama_model * model, const char * name);
// Returns true if the model contains an encoder that requires llama_encode() call
LLAMA_API bool llama_model_has_encoder(const struct llama_model * model);
// For encoder-decoder models, this function returns id of the token that must be provided
// to the decoder to start generating output sequence. For other models, it returns -1.
LLAMA_API llama_token llama_model_decoder_start_token(const struct llama_model * model);
// Returns 0 on success
LLAMA_API uint32_t llama_model_quantize(
const char * fname_inp,
@@ -770,6 +777,14 @@ extern "C" {
// Frees a batch of tokens allocated with llama_batch_init()
LLAMA_API void llama_batch_free(struct llama_batch batch);
// Processes a batch of tokens with the ecoder part of the encoder-decoder model.
// Stores the encoder output internally for later use by the decoder cross-attention layers.
// 0 - success
// < 0 - error
LLAMA_API int32_t llama_encode(
struct llama_context * ctx,
struct llama_batch batch);
// Positive return values does not mean a fatal error, but rather a warning.
// 0 - success
// 1 - could not find a KV slot for the batch (try reducing the size of the batch or increase the context)

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
1027
1005 3690
7592 1010 1061 1005 2035 999 2129 2024 2017 100 1029 1855 100 100 6207 100 100 14677 23632 22203 1811 1995
999 999 999 999 999 999
1017
3943
21211

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
206 1857
14 4515
28339 19 1770 14 1954 8 4070 1955 1933 80503 231 5691 12081 13336 2648 29325 14315 24 26 24 27 24 28 24 5123 18372
57178 10251
26
26 26
26 26 26

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
185 405
6 2895
17535 11 320 6 435 0 1717 417 340 12394 233 210 3015 19100 608 9413 2668 16 18 16 19 16 20 16 1393 169 121 239
15330 3023
18
18 18
18 18 18

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
185 403
6 2906
17464 11 320 6 436 0 1724 418 340 33701 210 3025 19017 612 9407 2681 16 18 16 19 16 20 16 1398 68940 239
15278 3033
18
18 18
18 18 18

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
1212 40
18 4932
9856 23 291 18 436 12 1265 362 299 8196 207 204 42 50087 123 2727 20300 32022 133 234 17419 30137 28 7858 181 133 236
51520
30
3138
22287

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
198 796
6 6980
15496 11 331 6 439 0 1374 389 345 30325 223 5633 22755 239 46349 111 28839 101 18040 32432 98 43291 1485 1415 24309 25465 171 121 252
13896 3228
18
2091
20370

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
198 284
6 11639
9906 11 379 65948 0 2650 527 499 27623 223 949 37046 101067 19000 23182 102301 9263 18136 16 36827 21909
17523 3001
18
1644
8765

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
29871 13 353
525 3152
15043 29892 343 29915 497 29991 1128 526 366 29871 243 162 155 132 1577 30672 31522 30505 11548 31041 30732 29896 29941 29896 29946 29896 29945 29896 30408 30739
1738 6824 21004
29871 29941
29871 29941 29941
29871 29941 29941 29941

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
187 426
8 8685
12092 13 340 8 455 2 1359 403 368 49042 212 3736 15367 41197 13610 19934 41869 21275 1012 1047 18795 40120 20422 241
18963 4672
20
1610
20084

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
29871 13 353
525 3152
15043 29892 343 29915 497 29991 1128 526 366 29871 243 162 155 132 1577 30672 31522 30505 11548 31041 30732 29896 29941 29896 29946 29896 29945 29896 30408 30739
1738 6824 21004
29871 29941
29871 29941 29941
29871 29941 29941 29941

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
198 284
6 11385
9707 11 379 64848 0 2585 525 498 26525 223 937 104100 18493 22377 99257 16 18 16 19 16 20 16 35727 21216
17085 2928
18
18 18
18 18 18

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
203 280
25 34666
8279 30 533 25 464 19 4971 884 844 18458 228 1018 4982 13368 2909 9513 17827 35 37 35 38 35 39 35 11873 47838
9163 3202
37
37 37
37 37 37

View File

@@ -73,6 +73,8 @@ __ggml_vocab_test__
__ggml_vocab_test__
Hello, y'all! How are you 😁 ?我想在apple工作1314151天
__ggml_vocab_test__
!!!!!!
__ggml_vocab_test__
3
__ggml_vocab_test__
33

View File

@@ -31,6 +31,7 @@
222 299
44 34719
8302 49 553 44 483 38 4998 904 863 18445 247 1037 4995 13379 2924 9515 17823 54 56 54 57 54 58 54 11904 47892
9221 3226
56
56 56
56 56 56

1197
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

44
pyproject.toml Normal file
View File

@@ -0,0 +1,44 @@
[tool.poetry]
name = "llama-cpp-scripts"
version = "0.0.0"
description = "Scripts that ship with llama.cpp"
authors = ["GGML <ggml@ggml.ai>"]
readme = "README.md"
homepage = "https://ggml.ai"
repository = "https://github.com/ggerganov/llama.cpp"
keywords = ["ggml", "gguf", "llama.cpp"]
packages = [{ include = "*.py", from = "." }]
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
[tool.poetry.dependencies]
python = ">=3.9"
numpy = "^1.25.0"
sentencepiece = ">=0.1.98,<0.2.0"
transformers = ">=4.35.2,<5.0.0"
protobuf = ">=4.21.0,<5.0.0"
gguf = { path = "./gguf-py" }
torch = { version = "^2.2.0", source = "pytorch" }
[tool.poetry.dev-dependencies]
pytest = "^5.2"
# Force wheel + cpu
# For discussion and context see https://github.com/python-poetry/poetry#6409
[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
[tool.poetry.scripts]
llama-convert-hf-to-gguf = "convert_hf_to_gguf:main"
llama-convert-llama-ggml-to-gguf = "convert_llama_ggml_to_gguf:main"
llama-ggml-vk-generate-shaders = "ggml_vk_generate_shaders:main"

View File

@@ -6,6 +6,6 @@
-r ./requirements/requirements-convert-legacy-llama.txt
-r ./requirements/requirements-convert-hf-to-gguf.txt
-r ./requirements/requirements-convert-hf-to-gguf-update.txt
-r ./requirements/requirements-convert-llama-ggml-to-gguf.txt
-r ./requirements/requirements-convert_hf_to_gguf.txt
-r ./requirements/requirements-convert_hf_to_gguf_update.txt
-r ./requirements/requirements-convert_llama_ggml_to_gguf.txt

View File

@@ -167,11 +167,11 @@ if (( do_cleanup )); then
fi
check_convert_script examples/convert-legacy-llama.py
for py in convert-*.py; do
for py in convert_*.py; do
# skip convert-hf-to-gguf-update.py
# TODO: the check is failing for some reason:
# https://github.com/ggerganov/llama.cpp/actions/runs/8875330981/job/24364557177?pr=6920
[[ $py == convert-hf-to-gguf-update.py ]] && continue
[[ $py == convert_hf_to_gguf_update.py ]] && continue
check_convert_script "$py"
done

View File

@@ -1,7 +1,5 @@
# TODO: should not use this
if (WIN32)
add_compile_definitions(_CRT_SECURE_NO_WARNINGS)
if (BUILD_SHARED_LIBS)
set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS ON)
endif()

File diff suppressed because it is too large Load Diff