Compare commits

..

33 Commits

Author SHA1 Message Date
Francis Couture-Harpin
75b3a09602 test-backend-ops : add TQ1_0 and TQ2_0 comments for later
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.
2024-09-04 15:00:21 -04:00
Francis Couture-Harpin
8d61607656 ggml ; remove unused ggml_mul special case
It would otherwise conflict with the more general
optimization coming with Mamba-2.

* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
2024-09-04 13:50:08 -04:00
Francis Couture-Harpin
7f3a619c98 Merge branch 'master' into compilade/bitnet-ternary 2024-09-04 13:26:50 -04:00
Francis Couture-Harpin
cb6d9962c4 Merge branch 'master' into compilade/bitnet-ternary
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2024-08-22 16:42:24 -04:00
Francis Couture-Harpin
35cc5567c8 ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2024-08-13 18:00:06 -04:00
Francis Couture-Harpin
82b240406d Merge branch 'master' into compilade/bitnet-ternary 2024-08-13 17:36:09 -04:00
Francis Couture-Harpin
69f772682e ggml-quants : allow using ARM dot product instructions for TQ1_0 2024-08-13 17:21:19 -04:00
Francis Couture-Harpin
895004f3f8 convert : allow direct conversion to TQ1_0 and TQ2_0
The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.

* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0

Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.
2024-08-13 17:17:43 -04:00
Francis Couture-Harpin
3a0bf17d57 gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0

This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.
2024-08-12 00:06:48 -04:00
Francis Couture-Harpin
d911cd1f13 Merge branch 'master' into compilade/bitnet-ternary 2024-08-11 15:52:29 -04:00
Francis Couture-Harpin
96b3d411e0 ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
Some checks failed
flake8 Lint / Lint (push) Has been cancelled
Not yet tested on harware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.
2024-08-07 15:08:41 -04:00
Francis Couture-Harpin
f034aa1bb1 ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency 2024-08-03 16:22:04 -04:00
Francis Couture-Harpin
04eec58112 ggml : remove q1_3 and q2_2
Some checks failed
Python check requirements.txt / check-requirements (push) Has been cancelled
flake8 Lint / Lint (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.
2024-08-02 20:16:26 -04:00
Francis Couture-Harpin
45719a2472 ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
Some checks failed
flake8 Lint / Lint (push) Has been cancelled
The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.
2024-08-01 01:11:30 -04:00
Francis Couture-Harpin
5417089aeb ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
Some checks are pending
flake8 Lint / Lint (push) Waiting to run
2024-07-31 23:35:04 -04:00
Francis Couture-Harpin
a6dd6994a5 ggml : fix build issues in certain environments 2024-07-31 23:14:36 -04:00
Francis Couture-Harpin
e9719576c4 ggml : also faster TQ1_0
Some checks are pending
flake8 Lint / Lint (push) Waiting to run
Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.
2024-07-31 00:08:48 -04:00
Francis Couture-Harpin
560873f337 ggml : even faster TQ2_0
Some checks failed
flake8 Lint / Lint (push) Waiting to run
Python check requirements.txt / check-requirements (push) Has been cancelled
Python Type-Check / pyright type-check (push) Has been cancelled
2024-07-30 23:36:52 -04:00
Francis Couture-Harpin
77b8f84ae7 ggml : add TQ1_0 and TQ2_0 ternary quantization types 2024-07-30 18:33:15 -04:00
Francis Couture-Harpin
79a278e922 Merge branch 'master' into compilade/bitnet-ternary 2024-07-28 21:27:33 -04:00
Francis Couture-Harpin
dd3e62a703 ggml : add some informative comments in q1_3 vec_dot 2024-07-28 21:17:16 -04:00
Francis Couture-Harpin
8fbd59308b ggml-quants : attempt to fix Arm 32-bit support 2024-06-28 22:52:57 -04:00
Francis Couture-Harpin
ec50944bf6 ggml-quants : fix build failure on Windows 2024-06-28 20:41:13 -04:00
Francis Couture-Harpin
bfd2f21fb4 bitnet : replace 1.58b with b1.58, as in the paper 2024-06-28 20:38:12 -04:00
Francis Couture-Harpin
0996149911 convert-hf : allow converting the weird BitNet 1.3B
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
961e293833 convert-hf : simplify BitNet pre-quantization
This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
89dc3b254c ggml-quants : use ceiling division when quantizing q1_3 2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
9465ec6e12 ggml-quants : ARM NEON vec_dot for q2_2 and q1_3 2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
638ad52f87 ggml-quants : cleanup Q1_3 code formatting 2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
ef1e345c85 ggml-quants : Q2_2 now faster than Q4_K on with AVX2 2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
48b73b8498 ggml-quants : substract 1 when back in epi8
This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
7ef4254a92 ggml-quants : faster 1.625 bpw AVX2 vec_dot
Not using a lookup table anymore makes it match q4_0 speed.

* gguf-py : fix formatting

* llama : remove spaces on empty line
2024-06-27 02:06:28 -04:00
Francis Couture-Harpin
bd807499f7 ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b 2024-06-27 02:06:22 -04:00
10 changed files with 24 additions and 59 deletions

View File

@@ -857,7 +857,7 @@ jobs:
run: |
mkdir build
cd build
cmake .. -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=ON -DGGML_RPC=ON
cmake .. -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=ON
cmake --build . --config Release -j $((${env:NUMBER_OF_PROCESSORS} - 1)) -t ggml
cmake --build . --config Release -j ${env:NUMBER_OF_PROCESSORS}

View File

@@ -32,8 +32,8 @@
{
"name": "arm64-windows-msvc", "hidden": true,
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x64", "strategy": "external" },
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x86_64", "strategy": "external" },
"cacheVariables": {
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/arm64-windows-msvc.cmake"
}
@@ -41,8 +41,8 @@
{
"name": "arm64-windows-llvm", "hidden": true,
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x64", "strategy": "external" },
"architecture": { "value": "arm64", "strategy": "external" },
"toolset": { "value": "host=x86_64", "strategy": "external" },
"cacheVariables": {
"CMAKE_TOOLCHAIN_FILE": "${sourceDir}/cmake/arm64-windows-llvm.cmake"
}

View File

@@ -124,9 +124,6 @@ static std::string get_cpu_info() {
(LPBYTE)cpu_brand,
&cpu_brand_size) == ERROR_SUCCESS) {
id.assign(cpu_brand, cpu_brand_size);
if (id.find('\0') != std::string::npos) {
id.resize(id.find('\0'));
}
}
RegCloseKey(hKey);
#endif

View File

@@ -412,7 +412,6 @@ struct server_queue {
// multi-task version of post()
int post(std::vector<server_task> & tasks, bool front = false) {
std::unique_lock<std::mutex> lock(mutex_tasks);
for (auto & task : tasks) {
if (task.id == -1) {
task.id = id++;

View File

@@ -135,7 +135,6 @@ option(GGML_VULKAN "ggml: use Vulkan"
option(GGML_VULKAN_CHECK_RESULTS "ggml: run Vulkan op checks" OFF)
option(GGML_VULKAN_DEBUG "ggml: enable Vulkan debug output" OFF)
option(GGML_VULKAN_MEMORY_DEBUG "ggml: enable Vulkan memory debug output" OFF)
option(GGML_VULKAN_SHADER_DEBUG_INFO "ggml: enable Vulkan shader debug info" OFF)
option(GGML_VULKAN_PERF "ggml: enable Vulkan perf output" OFF)
option(GGML_VULKAN_VALIDATE "ggml: enable Vulkan validation" OFF)
option(GGML_VULKAN_RUN_TESTS "ggml: run Vulkan tests" OFF)

View File

@@ -612,10 +612,6 @@ if (GGML_VULKAN)
add_compile_definitions(GGML_VULKAN_MEMORY_DEBUG)
endif()
if (GGML_VULKAN_SHADER_DEBUG_INFO)
add_compile_definitions(GGML_VULKAN_SHADER_DEBUG_INFO)
endif()
if (GGML_VULKAN_PERF)
add_compile_definitions(GGML_VULKAN_PERF)
endif()

View File

@@ -1165,11 +1165,6 @@ static int ggml_backend_sched_backend_id_from_cur(ggml_backend_sched_t sched, st
}
}
if (tensor->buffer || (tensor->view_src && tensor->view_src->buffer)) {
// since the tensor is pre-allocated, it cannot be moved to another backend
GGML_ABORT("pre-allocated tensor in a backend that cannot run the operation");
}
// graph input
if (tensor->flags & GGML_TENSOR_FLAG_INPUT) {
cur_backend_id = sched->n_backends - 1; // last backend (assumed CPU)
@@ -1649,7 +1644,7 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
sched->prev_leaf_backend_ids = tmp;
}
int graph_size = MAX(graph->n_nodes, graph->n_leafs) + sched->n_splits*GGML_SCHED_MAX_SPLIT_INPUTS*2*sched->n_copies;
int graph_size = graph->n_nodes + sched->n_splits*GGML_SCHED_MAX_SPLIT_INPUTS*2;
if (sched->graph.size < graph_size) {
sched->graph.size = graph_size;
sched->graph.nodes = realloc(sched->graph.nodes, graph_size * sizeof(struct ggml_tensor *));
@@ -1701,7 +1696,6 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
for (int c = 0; c < sched->n_copies; c++) {
struct ggml_tensor * input_cpy = tensor_id_copy(id, backend_id, c);
sched->leaf_backend_ids[graph_copy->n_leafs] = backend_id;
assert(graph_copy->size > graph_copy->n_leafs);
graph_copy->leafs[graph_copy->n_leafs++] = input_cpy;
}
}
@@ -1715,7 +1709,6 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
for (int c = 0; c < sched->n_copies; c++) {
struct ggml_tensor * input_cpy = tensor_id_copy(id, backend_id, c);
sched->leaf_backend_ids[graph_copy->n_leafs] = backend_id;
assert(graph_copy->size > graph_copy->n_leafs);
graph_copy->leafs[graph_copy->n_leafs++] = input_cpy;
}
}
@@ -1726,7 +1719,6 @@ static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct gg
for (int i = 0; i < graph->n_leafs; i++) {
struct ggml_tensor * leaf = graph->leafs[i];
sched->leaf_backend_ids[graph_copy->n_leafs] = tensor_backend_id(leaf);
assert(graph_copy->size > graph_copy->n_leafs);
graph_copy->leafs[graph_copy->n_leafs++] = leaf;
}
}

View File

@@ -2572,15 +2572,8 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
cuda_ctx->cuda_graph->updated_kernel_arg.push_back((char **) &(node->src[1]->data));
// store a pointer to each copy op CUDA kernel to identify it later
void * ptr = ggml_cuda_cpy_fn(node->src[0], node->src[1]);
if (!ptr) {
use_cuda_graph = false;
#ifndef NDEBUG
GGML_CUDA_LOG_WARN("%s: disabling CUDA graphs due to unsupported copy op\n", __func__);
#endif
} else {
if (std::find(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), ptr) == ggml_cuda_cpy_fn_ptrs.end()) {
ggml_cuda_cpy_fn_ptrs.push_back(ptr);
}
if (std::find(ggml_cuda_cpy_fn_ptrs.begin(), ggml_cuda_cpy_fn_ptrs.end(), ptr) == ggml_cuda_cpy_fn_ptrs.end()) {
ggml_cuda_cpy_fn_ptrs.push_back(ptr);
}
}
@@ -2849,9 +2842,6 @@ GGML_CALL static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, cons
if (src0_type == GGML_TYPE_F16 && src1_type == GGML_TYPE_F32) {
return true;
}
if (src0_type == src1_type && ggml_is_contiguous(op->src[0]) && ggml_is_contiguous(op->src[1])) {
return true;
}
return false;
} break;
case GGML_OP_DUP:

View File

@@ -428,10 +428,7 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
char * src0_ddc = (char *) src0->data;
char * src1_ddc = (char *) src1->data;
if (src0->type == src1->type && ggml_is_contiguous(src0) && ggml_is_contiguous(src1)) {
GGML_ASSERT(ggml_nbytes(src0) == ggml_nbytes(src1));
CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream));
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F32) {
if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F32) {
ggml_cpy_f32_f32_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream);
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F16) {
ggml_cpy_f32_f16_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream);
@@ -452,8 +449,9 @@ void ggml_cuda_cpy(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, gg
} else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
ggml_cpy_f16_f32_cuda (src0_ddc, src1_ddc, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03, ne10, ne11, ne12, nb10, nb11, nb12, nb13, main_stream);
} else {
GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
fprintf(stderr, "%s: unsupported type combination (%s to %s)\n", __func__,
ggml_type_name(src0->type), ggml_type_name(src1->type));
GGML_ABORT("fatal error");
}
}
@@ -463,30 +461,29 @@ void ggml_cuda_dup(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
}
void* ggml_cuda_cpy_fn(const ggml_tensor * src0, ggml_tensor * src1) {
if (src0->type == src1->type && ggml_is_contiguous(src0) && ggml_is_contiguous(src1)) {
return nullptr;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F32) {
return (void*) cpy_f32_f16<cpy_1_f32_f32>;
if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F32) {
return (void*) cpy_f32_f16<cpy_1_f32_f32>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_F16) {
return (void*) cpy_f32_f16<cpy_1_f32_f16>;
return (void*) cpy_f32_f16<cpy_1_f32_f16>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_Q8_0) {
return (void*) cpy_f32_q<cpy_blck_f32_q8_0, QK8_0>;
return (void*) cpy_f32_q<cpy_blck_f32_q8_0, QK8_0>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_Q4_0) {
return (void*) cpy_f32_q<cpy_blck_f32_q4_0, QK4_0>;
return (void*) cpy_f32_q<cpy_blck_f32_q4_0, QK4_0>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_Q4_1) {
return (void*) cpy_f32_q<cpy_blck_f32_q4_1, QK4_1>;
return (void*) cpy_f32_q<cpy_blck_f32_q4_1, QK4_1>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_Q5_0) {
return (void*) cpy_f32_q<cpy_blck_f32_q5_0, QK5_0>;
return (void*) cpy_f32_q<cpy_blck_f32_q5_0, QK5_0>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_IQ4_NL) {
return (void*) cpy_f32_q<cpy_blck_f32_iq4_nl, QK4_NL>;
return (void*) cpy_f32_q<cpy_blck_f32_iq4_nl, QK4_NL>;
} else if (src0->type == GGML_TYPE_F32 && src1->type == GGML_TYPE_Q5_1) {
return (void*) cpy_f32_q<cpy_blck_f32_q5_1, QK5_1>;
return (void*) cpy_f32_q<cpy_blck_f32_q5_1, QK5_1>;
} else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F16) {
return (void*) cpy_f32_f16<cpy_1_f32_f16>;
return (void*) cpy_f32_f16<cpy_1_f32_f16>;
} else if (src0->type == GGML_TYPE_F16 && src1->type == GGML_TYPE_F32) {
return (void*) cpy_f32_f16<cpy_1_f16_f32>;
return (void*) cpy_f32_f16<cpy_1_f16_f32>;
} else {
GGML_ABORT("%s: unsupported type combination (%s to %s)\n", __func__,
fprintf(stderr, "%s: unsupported type combination (%s to %s)\n", __func__,
ggml_type_name(src0->type), ggml_type_name(src1->type));
GGML_ABORT("fatal error");
}
}

View File

@@ -200,11 +200,6 @@ void string_to_spv(const std::string& _name, const std::string& in_fname, const
#else
std::vector<std::string> cmd = {GLSLC, "-fshader-stage=compute", "--target-env=vulkan1.2", "-O", in_path, "-o", out_fname};
#endif
#ifdef GGML_VULKAN_SHADER_DEBUG_INFO
cmd.push_back("-g");
#endif
for (const auto& define : defines) {
cmd.push_back("-D" + define.first + "=" + define.second);
}