mirror of
https://github.com/ggerganov/llama.cpp.git
synced 2026-02-05 13:53:23 +02:00
* server: introduce self-speculative decoding * server: moved self-call into speculative.cpp * can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: can_speculate() tests self-spec * server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: can_speculate() requires a task instance * common: ngram map, config self-speculative decoding * common: add enum common_speculative_type * common: add vector of speculative states * common: add option --spec-draftless * server: cleanup (remove slot.batch_spec, rename) * common: moved self-spec impl to ngram-map * common: cleanup (use common_speculative_state_draft) * spec : refactor * cont : naming * spec: remove --spec-config * doc: (draftless) speculative decoding * common: print performance in spec decoding * minor : cleanup * common : better names * minor : cleanup + fix build * minor: comments * CODEOWNERS: add common/ngram-map.* (#18471) * common : rename speculative.draftless_type -> speculative.type * ngram-map : fix uninitialized values * ngram-map : take into account the input can become shorter * ngram-map : revert len check for now * arg : change `--spec-draftless` -> `--spec-type` * spec : add common_speculative_state::accept() * spec : refactor + add common_speculative_begin() * spec : fix begin() call with mtmd * spec : additional refactor + remove common_speculative_params --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
41 lines
1.1 KiB
C++
41 lines
1.1 KiB
C++
#include "arg.h"
|
|
#include "common.h"
|
|
#include "ngram-cache.h"
|
|
#include "llama.h"
|
|
|
|
#include <string>
|
|
#include <vector>
|
|
|
|
int main(int argc, char ** argv){
|
|
common_params params;
|
|
|
|
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_LOOKUP)) {
|
|
return 1;
|
|
}
|
|
|
|
// init llama.cpp
|
|
llama_backend_init();
|
|
llama_numa_init(params.numa);
|
|
|
|
// load the model
|
|
auto llama_init = common_init_from_params(params);
|
|
|
|
auto * model = llama_init->model();
|
|
auto * ctx = llama_init->context();
|
|
|
|
GGML_ASSERT(model != nullptr);
|
|
|
|
// tokenize the prompt
|
|
std::vector<llama_token> inp;
|
|
inp = common_tokenize(ctx, params.prompt, true, true);
|
|
fprintf(stderr, "%s: tokenization done\n", __func__);
|
|
|
|
common_ngram_cache ngram_cache;
|
|
common_ngram_cache_update(ngram_cache, LLAMA_NGRAM_STATIC, LLAMA_NGRAM_STATIC, inp, inp.size(), true);
|
|
fprintf(stderr, "%s: hashing done, writing file to %s\n", __func__, params.speculative.lookup_cache_static.c_str());
|
|
|
|
common_ngram_cache_save(ngram_cache, params.speculative.lookup_cache_static);
|
|
|
|
return 0;
|
|
}
|