scipts : add LICENSE and gen-authors.sh to sync

authors : update
license : add AUTHORS
2026-04-23 16:37:33 +03:00 · 2024-04-09 09:19:33 +03:00 · 2024-04-09 09:17:25 +03:00 · 2024-03-31 10:37:39 +03:00 · 2024-03-30 12:36:07 +02:00 · 2024-03-29 22:59:56 +01:00
31 changed files with 27553 additions and 3744 deletions
--- a/.devops/llama-cpp-clblast.srpm.spec
+++ b/.devops/llama-cpp-clblast.srpm.spec
@@ -1,5 +1,5 @@
 # SRPM for building from source and packaging an RPM for RPM-based distros.
-# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
+# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
 # Built and maintained by John Boero - boeroboy@gmail.com
 # In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

--- a/.devops/llama-cpp-cuda.srpm.spec
+++ b/.devops/llama-cpp-cuda.srpm.spec
@@ -1,5 +1,5 @@
 # SRPM for building from source and packaging an RPM for RPM-based distros.
-# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
+# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
 # Built and maintained by John Boero - boeroboy@gmail.com
 # In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

--- a/.devops/llama-cpp.srpm.spec
+++ b/.devops/llama-cpp.srpm.spec
@@ -1,5 +1,5 @@
 # SRPM for building from source and packaging an RPM for RPM-based distros.
-# https://fedoraproject.org/wiki/How_to_create_an_RPM_package
+# https://docs.fedoraproject.org/en-US/quick-docs/creating-rpm-packages
 # Built and maintained by John Boero - boeroboy@gmail.com
 # In honor of Seth Vidal https://www.redhat.com/it/blog/thank-you-seth-vidal

--- a/.github/workflows/bench.yml
+++ b/.github/workflows/bench.yml
@@ -25,7 +25,7 @@ on:
    branches:
      - master
    paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
-  pull_request:
+  pull_request_target:
    types: [opened, synchronize, reopened]
    paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*']
  schedule:
@@ -42,7 +42,7 @@ jobs:
      RUNNER_LABEL: Standard_NC4as_T4_v3 # FIXME Do not find a way to not duplicate it
      N_USERS: 8
      DURATION: 10m
-    if: ${{ github.event.inputs.gpu-series == 'Standard_NC4as_T4_v3' || github.event.schedule || github.event.pull_request || github.event.push.ref == 'refs/heads/master' }}
+    if: ${{ github.event.inputs.gpu-series == 'Standard_NC4as_T4_v3' || github.event.schedule || github.event.pull_request || github.head_ref == 'master' || github.ref_name == 'master' || github.event.push.ref == 'refs/heads/master' }}
    steps:
      - name: Clone
        id: checkout
--- a/655
+++ b/655
@@ -0,0 +1,655 @@
+# date: Tue Apr  9 09:17:14 EEST 2024
+# this file is auto-generated by scripts/gen-authors.sh
+
+0cc4m <picard12@live.de>
+0xspringtime <110655352+0xspringtime@users.noreply.github.com>
+2f38b454 <dxf@protonmail.com>
+3ooabkhxtn <31479382+3ooabkhxtn@users.noreply.github.com>
+44670 <44670@users.noreply.github.com>
+AN Long <aisk@users.noreply.github.com>
+AT <manyoso@users.noreply.github.com>
+Aarni Koskela <akx@iki.fi>
+Aaron Miller <apage43@ninjawhale.com>
+Aaryaman Vasishta <aaryaman.vasishta@amd.com>
+Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
+Abhishek Gopinath K <31348521+overtunned@users.noreply.github.com>
+Adithya Balaji <adithya.b94@gmail.com>
+AdithyanI <adithyan.i4internet@gmail.com>
+Adrian <smith.adriane@gmail.com>
+Adrian Hesketh <a-h@users.noreply.github.com>
+AidanBeltonS <87009434+AidanBeltonS@users.noreply.github.com>
+Aisuko <urakiny@gmail.com>
+Alberto <57916483+albbus-stack@users.noreply.github.com>
+Alex <awhill19@icloud.com>
+Alex Azarov <alex@azarov.by>
+Alex Azarov <alexander.azarov@mapbox.com>
+Alex Klinkhamer <from.github.com.917@grencez.dev>
+Alex Klinkhamer <git@grencez.dev>
+Alex Nguyen <tiendung@users.noreply.github.com>
+Alex Petenchea <alex.petenchea@gmail.com>
+Alex Renda <alexrenda@users.noreply.github.com>
+Alex von Gluck IV <kallisti5@unixzen.com>
+Alexey Parfenov <zxed@alkatrazstudio.net>
+Ali Chraghi <63465728+alichraghi@users.noreply.github.com>
+Ali Nehzat <ali.nehzat@thanks.dev>
+Ali Tariq <ali.tariq@10xengineers.ai>
+Alon <alonfaraj@gmail.com>
+AlpinDale <52078762+AlpinDale@users.noreply.github.com>
+AmirAli Mirian <37371367+amiralimi@users.noreply.github.com>
+Ananta Bastola <anantarajbastola@gmail.com>
+Anas Ahouzi <112881240+aahouzi@users.noreply.github.com>
+András Salamon <ott2@users.noreply.github.com>
+Andrei <abetlen@gmail.com>
+Andrew Canis <andrew.canis@gmail.com>
+Andrew Duffy <a10y@users.noreply.github.com>
+Andrew Godfrey <AndrewGodfrey@users.noreply.github.com>
+Arik Poznanski <arikpoz@users.noreply.github.com>
+Artem <guinmoon@gmail.com>
+Artyom Lebedev <vagran.ast@gmail.com>
+Asbjørn Olling <asbjornolling@gmail.com>
+Ásgeir Bjarni Ingvarsson <asgeir@fundinn.org>
+Ashok Gelal <401055+ashokgelal@users.noreply.github.com>
+Ashraful Islam <ashraful.meche@gmail.com>
+Atsushi Tatsuma <yoshoku@outlook.com>
+Austin <77757836+teleprint-me@users.noreply.github.com>
+AustinMroz <austinmroz@utexas.edu>
+BADR <contact@pythops.com>
+Bach Le <bach@bullno1.com>
+Bailey Chittle <39804642+bachittle@users.noreply.github.com>
+BarfingLemurs <128182951+BarfingLemurs@users.noreply.github.com>
+Behnam M <58621210+ibehnam@users.noreply.github.com>
+Ben Garney <bengarney@users.noreply.github.com>
+Ben Siraphob <bensiraphob@gmail.com>
+Ben Williams <ben@719ben.com>
+Benjamin Lecaillon <84293038+blecaillon@users.noreply.github.com>
+Bernat Vadell <hounter.caza@gmail.com>
+Bodo Graumann <mail@bodograumann.de>
+Bono Lv <lvscar@users.noreply.github.com>
+Borislav Stanimirov <b.stanimirov@abv.bg>
+Branden Butler <bwtbutler@hotmail.com>
+Brian <mofosyne@gmail.com>
+Bruce MacDonald <brucewmacdonald@gmail.com>
+CJ Pais <cj@cjpais.com>
+CRD716 <crd716@gmail.com>
+Cameron <csteele@steelecameron.com>
+Cameron Kaiser <classilla@users.noreply.github.com>
+Casey Primozic <casey@cprimozic.net>
+Casey Primozic <me@ameo.link>
+CausalLM <148736309+CausalLM@users.noreply.github.com>
+Cebtenzzre <cebtenzzre@gmail.com>
+Chad Brewbaker <crb002@gmail.com>
+Cheng Shao <terrorjack@type.dance>
+Chris Kuehl <ckuehl@ckuehl.me>
+Christian Demsar <christian@github.email.demsar.us>
+Christian Demsar <crasm@git.vczf.us>
+Christian Falch <875252+chrfalch@users.noreply.github.com>
+Christian Kögler <ck3d@gmx.de>
+Clark Saben <76020733+csaben@users.noreply.github.com>
+Clint Herron <hanclinto@gmail.com>
+Cuong Trinh Manh <nguoithichkhampha@gmail.com>
+DAN™ <dranger003@gmail.com>
+Damian Stewart <d@damianstewart.com>
+Dane Madsen <dane_madsen@hotmail.com>
+DaniAndTheWeb <57776841+DaniAndTheWeb@users.noreply.github.com>
+Daniel Bevenius <daniel.bevenius@gmail.com>
+Daniel Drake <drake@endlessos.org>
+Daniel Hiltgen <dhiltgen@users.noreply.github.com>
+Daniel Illescas Romero <illescas.daniel@protonmail.com>
+DannyDaemonic <DannyDaemonic@gmail.com>
+Dat Quoc Nguyen <2412555+datquocnguyen@users.noreply.github.com>
+Dave Della Costa <ddellacosta+github@gmail.com>
+David Friehs <david@friehs.info>
+David Kennedy <dakennedyd@gmail.com>
+David Pflug <david@pflug.email>
+David Renshaw <dwrenshaw@gmail.com>
+David Sommers <12738+databyte@users.noreply.github.com>
+David Yang <davidyang6us@gmail.com>
+Dawid Wysocki <62249621+TortillaZHawaii@users.noreply.github.com>
+Dean <Dean.Sinaean@gmail.com>
+Deins <deinsegle@gmail.com>
+Didzis Gosko <didzis@users.noreply.github.com>
+Don Mahurin <dmahurin@users.noreply.github.com>
+DooWoong Lee (David) <manics99@naver.com>
+Doomsdayrs <38189170+Doomsdayrs@users.noreply.github.com>
+Douglas Hanley <thesecretaryofwar@gmail.com>
+Dr. Tom Murphy VII Ph.D <499244+tom7@users.noreply.github.com>
+Ebey Abraham <ebey97@gmail.com>
+Ed Lee <edilee@mozilla.com>
+Ed Lepedus <ed.lepedus@googlemail.com>
+Edward Taylor <edeetee@gmail.com>
+Elbios <141279586+Elbios@users.noreply.github.com>
+Engininja2 <139037756+Engininja2@users.noreply.github.com>
+Equim <sayaka@ekyu.moe>
+Eric Sommerlade <es0m@users.noreply.github.com>
+Eric Zhang <34133756+EZForever@users.noreply.github.com>
+Erik Garrison <erik.garrison@gmail.com>
+Erik Scholz <Green-Sky@users.noreply.github.com>
+Ettore Di Giacinto <mudler@users.noreply.github.com>
+Evan Jones <evan.q.jones@gmail.com>
+Evan Miller <emmiller@gmail.com>
+Eve <139727413+netrunnereve@users.noreply.github.com>
+Evgeny Kurnevsky <kurnevsky@gmail.com>
+Ewout ter Hoeven <E.M.terHoeven@student.tudelft.nl>
+ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
+FK <sozforex@gmail.com>
+Fabian <cmdrf@users.noreply.github.com>
+Fabio R. Sluzala <Fabio3rs@users.noreply.github.com>
+Faez Shakil <faez.shakil@gmail.com>
+FantasyGmm <16450052+FantasyGmm@users.noreply.github.com>
+Fattire <528174+fat-tire@users.noreply.github.com>
+Felix <stenbackfelix@gmail.com>
+Finn Voorhees <finnvoorhees@gmail.com>
+Firat <firatkiral@gmail.com>
+Folko-Ven <71110216+Folko-Ven@users.noreply.github.com>
+Foul-Tarnished <107711110+Foul-Tarnished@users.noreply.github.com>
+Francisco Melo <43780565+francis2tm@users.noreply.github.com>
+FrankHB <frankhb1989@gmail.com>
+Frederik Vogel <Schaltfehler@users.noreply.github.com>
+Gabe Goodhart <gabe.l.hart@gmail.com>
+GainLee <perfecter.gen@gmail.com>
+Galunid <karolek1231456@gmail.com>
+Gary Linscott <glinscott@gmail.com>
+Gary Mulder <gjmulder@gmail.com>
+Genkagaku.GPT <hlhr202@163.com>
+Georgi Gerganov <ggerganov@gmail.com>
+Gilad S <giladgd@users.noreply.github.com>
+GiviMAD <GiviMAD@users.noreply.github.com>
+Govlzkoy <gotope@users.noreply.github.com>
+Guillaume "Vermeille" Sanchez <Guillaume.V.Sanchez@gmail.com>
+Guillaume Wenzek <gwenzek@users.noreply.github.com>
+Guoteng <32697156+SolenoidWGT@users.noreply.github.com>
+Gustavo Rocha Dias <91472747+gustrd@users.noreply.github.com>
+Halalaluyafail3 <55773281+Halalaluyafail3@users.noreply.github.com>
+Haohui Mai <ricetons@gmail.com>
+Haoxiang Fei <tonyfettes@tonyfettes.com>
+Harald Fernengel <harald.fernengel@here.com>
+Hatsune Miku <129688334+at8u@users.noreply.github.com>
+Henk Poley <HenkPoley@gmail.com>
+Henri Vasserman <henv@hot.ee>
+Henrik Forstén <henrik.forsten@gmail.com>
+Herman Semenov <GermanAizek@yandex.ru>
+Hesen Peng <hesen.peng@gmail.com>
+Hoang Nguyen <hugo53@users.noreply.github.com>
+Hongyu Ouyang <96765450+casavaca@users.noreply.github.com>
+Howard Su <howard0su@gmail.com>
+Hua Jiang <allenhjiang@outlook.com>
+Huawei Lin <huaweilin.cs@gmail.com>
+Ian Bull <irbull@eclipsesource.com>
+Ian Bull <irbull@gmail.com>
+Ian Scrivener <github@zilogy.asia>
+Ido S <ido.pluto@gmail.com>
+IgnacioFDM <ignaciofdm@gmail.com>
+Igor Okulist <okigan@gmail.com>
+Ikko Eltociear Ashimine <eltociear@gmail.com>
+Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com>
+Ionoclast Laboratories <brigham@ionoclast.com>
+Isaac McFadyen <isaac@imcf.me>
+IsaacDynamo <61521674+IsaacDynamo@users.noreply.github.com>
+Ivan Komarov <Ivan.Komarov@dfyz.info>
+Ivan Stepanov <ivanstepanovftw@gmail.com>
+JH23X <165871467+JH23X@users.noreply.github.com>
+Jack Mousseau <jmousseau@users.noreply.github.com>
+JackJollimore <130917767+JackJollimore@users.noreply.github.com>
+Jag Chadha <jagtesh@gmail.com>
+Jakub N <jakubniemczyk97@gmail.com>
+James Reynolds <magnusviri@users.noreply.github.com>
+Jan Boon <jan.boon@kaetemi.be>
+Jan Boon <kaetemi@gmail.com>
+Jan Ploski <jpl@plosquare.com>
+Jannis Schönleber <joennlae@gmail.com>
+Jared Van Bortel <cebtenzzre@gmail.com>
+Jared Van Bortel <jared@nomic.ai>
+Jason McCartney <jmac@theroot.org>
+Jean-Christophe Hoelt <hoelt@fovea.cc>
+Jean-Michaël Celerier <jeanmichael.celerier+github@gmail.com>
+Jed Fox <git@jedfox.com>
+Jeffrey Quesnelle <emozilla@nousresearch.com>
+Jesse Jojo Johnson <williamsaintgeorge@gmail.com>
+Jhen-Jie Hong <iainst0409@gmail.com>
+Jiahao Li <liplus17@163.com>
+Jian Liao <jianliao@users.noreply.github.com>
+JidongZhang-THU <1119708529@qq.com>
+Jinwoo Jeong <33892306+williamjeong2@users.noreply.github.com>
+Jiří Podivín <66251151+jpodivin@users.noreply.github.com>
+Johannes Gäßler <johannesg@5d6.de>
+Johannes Rudolph <johannes.rudolph@gmail.com>
+John <78893154+cmp-nct@users.noreply.github.com>
+John Balis <phobossystems@gmail.com>
+John Smith <67539080+kingsidelee@users.noreply.github.com>
+JohnnyB <jboero@users.noreply.github.com>
+Jonas Wunderlich <32615971+jonas-w@users.noreply.github.com>
+Jorge A <161275481+jorgealias@users.noreply.github.com>
+Jose Maldonado <63384398+yukiteruamano@users.noreply.github.com>
+Joseph Stahl <1269177+josephst@users.noreply.github.com>
+Joyce <joycebrum@google.com>
+Juan Calderon-Perez <835733+gaby@users.noreply.github.com>
+Judd <foldl@users.noreply.github.com>
+Julius Arkenberg <arki05@users.noreply.github.com>
+Jun Jie <71215065+junnjiee16@users.noreply.github.com>
+Juraj Bednar <juraj@bednar.io>
+Justin Parker <jparkerweb@gmail.com>
+Justin Suess <justin.suess@westpoint.edu>
+Justine Tunney <jtunney@gmail.com>
+Juuso Alasuutari <juuso.alasuutari@gmail.com>
+KASR <karim.asrih@gmail.com>
+Kamil Tomšík <info@tomsik.cz>
+Karsten Weiss <knweiss@gmail.com>
+Karthick <j.karthic2004@gmail.com>
+Karthik Kumar Viswanathan <195178+guilt@users.noreply.github.com>
+Karthik Sethuraman <k.seth1993@gmail.com>
+Kasumi <90275229+kasumi-1@users.noreply.github.com>
+Kawrakow <48489457+ikawrakow@users.noreply.github.com>
+Keiichi Tabata <keiichi.tabata@outlook.com>
+Kenvix ⭐ <kenvixzure@live.com>
+Kerfuffle <44031344+KerfuffleV2@users.noreply.github.com>
+Kevin Ji <1146876+kevinji@users.noreply.github.com>
+Kevin Kwok <antimatter15@gmail.com>
+Kevin Lo <kevlo@kevlo.org>
+Kolen Cheung <ickc@users.noreply.github.com>
+Konstantin Herud <konstantin.herud@denkbares.com>
+Konstantin Zhuravlyov <konstantin.zhuravlyov@amd.com>
+Kunshang Ji <kunshang.ji@intel.com>
+Kyle Liang <liangmanlai@gmail.com>
+Kyle Mistele <kyle@mistele.com>
+Kylin <56434533+KyL0N@users.noreply.github.com>
+Lars Grammel <lars.grammel@gmail.com>
+Laura <Tijntje_7@msn.com>
+Lee <44310445+lx200916@users.noreply.github.com>
+Lee Drake <b.lee.drake@gmail.com>
+Leng Yue <lengyue@lengyue.me>
+LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>
+Leonardo Neumann <leonardo@neumann.dev.br>
+Li Tan <tanliboy@gmail.com>
+Linwei Wang <wanix1988@gmail.com>
+LoganDark <github@logandark.mozmail.com>
+LostRuins <39025047+LostRuins@users.noreply.github.com>
+Luciano <lucianostrika44@gmail.com>
+Luo Tian <lt@basecity.com>
+M. Yusuf Sarıgöz <yusufsarigoz@gmail.com>
+Maarten ter Huurne <maarten@treewalker.org>
+Mack Straight <eiz@users.noreply.github.com>
+Maël Kerbiriou <m431.kerbiriou@gmail.com>
+MaggotHATE <clay1326@gmail.com>
+Marc Köhlbrugge <subscriptions@marckohlbrugge.com>
+Marco Matthies <71844+marcom@users.noreply.github.com>
+Marcus Dunn <51931484+MarcusDunn@users.noreply.github.com>
+Marian Cepok <marian.cepok@gmail.com>
+Mark Fairbairn <thebaron88@gmail.com>
+Marko Tasic <mtasic85@gmail.com>
+Martin Krasser <krasserm@googlemail.com>
+Martin Schwaighofer <mschwaig@users.noreply.github.com>
+Marvin Gießing <marvin.giessing@gmail.com>
+Mateusz Charytoniuk <mateusz.charytoniuk@protonmail.com>
+Matheus C. França <matheus-catarino@hotmail.com>
+Matheus Gabriel Alves Silva <matheusgasource@gmail.com>
+Mathieu Nayrolles <MathieuNls@users.noreply.github.com>
+Mathijs de Bruin <mathijs@mathijsfietst.nl>
+Matt Clayton <156335168+mattjcly@users.noreply.github.com>
+Matt Pulver <matt.pulver@heavy.ai>
+Matteo Boschini <12133566+mbosc@users.noreply.github.com>
+Matthew Tejo <matthew.tejo@gmail.com>
+Matvey Soloviev <blackhole89@gmail.com>
+Maxime <672982+maximegmd@users.noreply.github.com>
+Maximilian Winter <maximilian.winter.91@gmail.com>
+Meng Zhang <meng@tabbyml.com>
+Meng, Hengyu <hengyu.meng@intel.com>
+Merrick Christensen <merrick.christensen@gmail.com>
+Michael Coppola <m18coppola@gmail.com>
+Michael Hueschen <m@mhueschen.dev>
+Michael Kesper <mkesper@schokokeks.org>
+Michael Klimenko <mklimenko29@gmail.com>
+Michael Podvitskiy <podvitskiymichael@gmail.com>
+Michael Potter <NanoTekGuy@Gmail.com>
+Michaël de Vries <vriesdemichael@gmail.com>
+Mihai <mihai.chirculescu@yahoo.com>
+Mike <ytianhui2004@gmail.com>
+Minsoo Cheong <54794500+mscheong01@users.noreply.github.com>
+Mirko185 <mirkosig@gmail.com>
+Mirror Azure <54669636+MirrorAzure@users.noreply.github.com>
+Miwa / Ensan <63481257+ensan-hcl@users.noreply.github.com>
+Mohammadreza Hendiani <hendiani.mohammadreza@gmail.com>
+Murilo Santana <mvrilo@gmail.com>
+Musab Gultekin <musabgultekin@users.noreply.github.com>
+Nam D. Tran <42194884+namtranase@users.noreply.github.com>
+NawafAlansari <72708095+NawafAlansari@users.noreply.github.com>
+Nebula <infinitewormhole@gmail.com>
+Neo Zhang Jianyu <jianyu.zhang@intel.com>
+Neuman Vong <neuman.vong@gmail.com>
+Nexesenex <124105151+Nexesenex@users.noreply.github.com>
+Niall Coates <1349685+Niall-@users.noreply.github.com>
+Nicolai Weitkemper <kontakt@nicolaiweitkemper.de>
+Nigel Bosch <pnigelb@gmail.com>
+Niklas Korz <niklas@niklaskorz.de>
+Nindaleth <Nindaleth@users.noreply.github.com>
+Oleksandr Nikitin <oleksandr@tvori.info>
+Oleksii Maryshchenko <oleksii.maryshchenko@gmail.com>
+Olivier Chafik <ochafik@users.noreply.github.com>
+Ondřej Čertík <ondrej@certik.us>
+Ouadie EL FAROUKI <ouadie.elfarouki@codeplay.com>
+Paul Tsochantaris <ptsochantaris@icloud.com>
+Pavol Rusnak <pavol@rusnak.io>
+Pedro Cuenca <pedro@huggingface.co>
+Peter Sugihara <peter@campsh.com>
+Phil H <5756783+phiharri@users.noreply.github.com>
+Philip Taron <philip.taron@gmail.com>
+Phillip Kravtsov <phillip@kravtsov.net>
+Pierre Alexandre SCHEMBRI <pa.schembri@gmail.com>
+Pierrick Hymbert <pierrick.hymbert@gmail.com>
+Przemysław Pawełczyk <przemoc@gmail.com>
+Qin Yue Chen <71813199+chenqiny@users.noreply.github.com>
+Qingyou Meng <meng.qingyou@gmail.com>
+Qu Zongfu <43257352+yancaoweidaode@users.noreply.github.com>
+RJ Adriaansen <adriaansen@eshcc.eur.nl>
+Radoslav Gerganov <rgerganov@gmail.com>
+Radosław Gryta <radek.gryta@gmail.com>
+Rahul Vivek Nair <68507071+RahulVivekNair@users.noreply.github.com>
+Rand Xie <randxiexyy29@gmail.com>
+Randall Fitzgerald <randall@dasaku.net>
+Reinforce-II <fate@eastal.com>
+Riceball LEE <snowyu.lee@gmail.com>
+Richard Kiss <him@richardkiss.com>
+Richard Roberson <richardr1126@gmail.com>
+Rick G <26732651+TheFlipbook@users.noreply.github.com>
+Rickard Edén <rickardeden@gmail.com>
+Rickard Hallerbäck <rickard.hallerback@gmail.com>
+Rickey Bowers Jr <bitRAKE@gmail.com>
+Riley Stewart <ristew@users.noreply.github.com>
+Rinne <AsakusaRinne@gmail.com>
+Rinne <liu_yaohui1998@126.com>
+Robert Brisita <986796+rbrisita@users.noreply.github.com>
+Robert Sung-wook Shin <edp1096@users.noreply.github.com>
+Robey Holderith <robey@flaminglunchbox.net>
+Robyn <robyngraf@users.noreply.github.com>
+Roger Meier <r.meier@siemens.com>
+Roland <14355895+rbur0425@users.noreply.github.com>
+Romain D <90720+Artefact2@users.noreply.github.com>
+Romain Neutron <romain@neutron.io>
+Roman Parykin <donderom@gmail.com>
+Ron Evans <ron@hybridgroup.com>
+Ron Jailall <rojailal@gmail.com>
+Ronny Brendel <ronnybrendel@gmail.com>
+Ronsor <ronsor@ronsor.pw>
+Rowan Hart <rowanbhart@gmail.com>
+Rune <43761327+Rune-AI@users.noreply.github.com>
+Ryan Landay <rlanday@gmail.com>
+Ryder Wishart <ryderwishart@gmail.com>
+Rőczey Barnabás <31726601+An0nie@users.noreply.github.com>
+SakuraUmi <yukinon244@gmail.com>
+Salvador E. Tropea <stropea@inti.gob.ar>
+Sam Spilsbury <smspillaz@gmail.com>
+Sami Farin <3876865+Safari77@users.noreply.github.com>
+Samuel Maynard <samwmaynard@gmail.com>
+Sang-Kil Park <sang.park@42dot.ai>
+Seb C <47074056+Sebby37@users.noreply.github.com>
+Sebastián A <sebastian.aedo29@gmail.com>
+SebastianApel <13675545+SebastianApel@users.noreply.github.com>
+Senemu <10880819+Senemu@users.noreply.github.com>
+Sergey Alirzaev <zl29ah@gmail.com>
+Sergio López <slp@sinrega.org>
+SeungWon Jeong <65549245+redlion0929@users.noreply.github.com>
+ShadovvBeast <ShadovvBeast@gmail.com>
+Shakhar Dasgupta <shakhardasgupta@gmail.com>
+Shangning Xu <32517059+xushangning@users.noreply.github.com>
+Shijie <821898965@qq.com>
+Shintarou Okada <kokuzen@gmail.com>
+Shouzheng Liu <61452103+lshzh-ww@users.noreply.github.com>
+Shouzheng Liu <lshzh.hi@gmail.com>
+Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
+Simon Willison <swillison@gmail.com>
+Siwen Yu <yusiwen@gmail.com>
+Sky Yan <skyan83@gmail.com>
+Slaren <2141330+slaren@users.noreply.github.com>
+Slava Primenko <primenko.s@gmail.com>
+SoftwareRenderer <138734813+SoftwareRenderer@users.noreply.github.com>
+Someone <sergei.kozlukov@aalto.fi>
+Someone Serge <sergei.kozlukov@aalto.fi>
+Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
+Spencer Sutton <spencersutton@users.noreply.github.com>
+Srinivas Billa <nivibilla@gmail.com>
+Stefan Sydow <stefan@sydow.email>
+Stephan Walter <stephan@walter.name>
+Stephen Nichols <snichols@users.noreply.github.com>
+Steve Grubb <ausearch.1@gmail.com>
+Steven Roussey <sroussey@gmail.com>
+Steward Garcia <57494570+FSSRepo@users.noreply.github.com>
+Suaj Carrot <72162667+SuajCarrot@users.noreply.github.com>
+SuperUserNameMan <yoann@terminajones.com>
+Tai Duc Nguyen <taiducnguyen.drexel@gmail.com>
+Taikono-Himazin <kazu@po.harenet.ne.jp>
+Tameem <113388789+AhmadTameem@users.noreply.github.com>
+Tamotsu Takahashi <ttakah+github@gmail.com>
+Thái Hoàng Tâm <75922889+RoyalHeart@users.noreply.github.com>
+Thatcher Chamberlin <j.thatcher.c@gmail.com>
+Theia Vogel <theia@vgel.me>
+Thérence <13496987+Royalphax@users.noreply.github.com>
+Thibault Terrasson <thibault.terrasson@gmail.com>
+Thomas Klausner <wiz@gatalith.at>
+Tim Miller <drasticactions@users.noreply.github.com>
+Timmy Knight <r2d2fish@gmail.com>
+Timothy Cronin <40186632+4imothy@users.noreply.github.com>
+Ting Lou <ting.lou@gmail.com>
+Ting Sun <suntcrick@gmail.com>
+Tobias Lütke <tobi@shopify.com>
+Tom C <tom.corelis@gmail.com>
+Tom Jobbins <784313+TheBloke@users.noreply.github.com>
+Tomas <tom.tomas.36478119@gmail.com>
+Tomáš Pazdiora <tomas.pazdiora@gmail.com>
+Tristan Ross <rosscomputerguy@protonmail.com>
+Tungsten842 <886724vf@anonaddy.me>
+Tungsten842 <quantmint@protonmail.com>
+Tushar <ditsuke@protonmail.com>
+UEXTM.com <84163508+uextm@users.noreply.github.com>
+Uzo Nweke <uzoechi@gmail.com>
+Vaibhav Srivastav <vaibhavs10@gmail.com>
+Val Kharitonov <mail@kharvd.com>
+Valentin Konovalov <valle.ketsujin@gmail.com>
+Valentyn Bezshapkin <61702053+valentynbez@users.noreply.github.com>
+Victor Z. Peng <ziliangdotme@gmail.com>
+Vlad <spitfireage@gmail.com>
+Vladimir <bogdad@gmail.com>
+Vladimir Malyutin <first-leon@yandex.ru>
+Vladimir Zorin <vladimir@deviant.guru>
+Volodymyr Vitvitskyi <72226+signalpillar@users.noreply.github.com>
+WangHaoranRobin <56047610+WangHaoranRobin@users.noreply.github.com>
+Weird Constructor <weirdconstructor@gmail.com>
+Welby Seely <welbyseely@gmail.com>
+Wentai Zhang <rchardx@gmail.com>
+WillCorticesAI <150854901+WillCorticesAI@users.noreply.github.com>
+Willy Tarreau <w@1wt.eu>
+Wu Jian Ping <wujjpp@hotmail.com>
+Wu Jian Ping <wujp@greatld.com>
+Xiake Sun <xiake.sun@intel.com>
+Xiang (Kevin) Li <kevinli020508@gmail.com>
+Xiao-Yong Jin <jinxiaoyong@gmail.com>
+XiaotaoChen <chenxiaotao1234@gmail.com>
+Xiaoyi Chen <cxychina@gmail.com>
+Xingchen Song(宋星辰) <xingchensong1996@163.com>
+Xuan Son Nguyen <thichthat@gmail.com>
+Yann Follet <131855179+YannFollet@users.noreply.github.com>
+Yiming Cui <conandiy@vip.qq.com>
+Yishuo Wang <MeouSker77@outlook.com>
+Yueh-Po Peng <94939112+y10ab1@users.noreply.github.com>
+Yui <dev@sleepyyui.com>
+Yusuf Kağan Hanoğlu <hanoglu@yahoo.com>
+Yuval Peled <31162840+Yuval-Peled@users.noreply.github.com>
+ZHAOKAI WANG <sanxianwei@163.com>
+Zane Shannon <z@zcs.me>
+Zay <95888118+isaiahbjork@users.noreply.github.com>
+Zenix <zenixls2@gmail.com>
+Zhang Peiyuan <a1286225768@gmail.com>
+ZhouYuChen <zhouyuchen@naver.com>
+Ziad Ben Hadj-Alouane <zied.benhadjalouane@gmail.com>
+Ziang Wu <97337387+ZiangWu-77@users.noreply.github.com>
+Zsapi <martin1.zsapka@gmail.com>
+a-n-n-a-l-e-e <150648636+a-n-n-a-l-e-e@users.noreply.github.com>
+adel boussaken <netdur@gmail.com>
+afrideva <95653597+afrideva@users.noreply.github.com>
+akawrykow <142945436+akawrykow@users.noreply.github.com>
+alexpinel <93524949+alexpinel@users.noreply.github.com>
+alonfaraj <alonfaraj@gmail.com>
+andrijdavid <david@geek.mg>
+anon998 <131767832+anon998@users.noreply.github.com>
+anzz1 <anzz1@live.com>
+apaz <aarpazdera@gmail.com>
+apcameron <37645737+apcameron@users.noreply.github.com>
+arcrank <arcrank@gmail.com>
+arlo-phoenix <140345165+arlo-phoenix@users.noreply.github.com>
+at8u <129688334+at8u@users.noreply.github.com>
+automaticcat <daogiatuank54@gmail.com>
+bandoti <141645996+bandoti@users.noreply.github.com>
+beiller <beiller@gmail.com>
+bhubbb <79117352+bhubbb@users.noreply.github.com>
+bmwl <brian.marshall@tolko.com>
+bobqianic <129547291+bobqianic@users.noreply.github.com>
+bryanSwk <93190252+bryanSwk@users.noreply.github.com>
+bsilvereagle <bsilvereagle@users.noreply.github.com>
+bssrdf <merlintiger@hotmail.com>
+byte-6174 <88070277+byte-6174@users.noreply.github.com>
+cebtenzzre <cebtenzzre@gmail.com>
+chaihahaha <chai836275709@gmail.com>
+chiranko <96988916+chiranko@users.noreply.github.com>
+clibdev <52199778+clibdev@users.noreply.github.com>
+clyang <clyang@clyang.net>
+cocktailpeanut <121128867+cocktailpeanut@users.noreply.github.com>
+coezbek <c.oezbek@gmail.com>
+comex <comexk@gmail.com>
+compilade <113953597+compilade@users.noreply.github.com>
+crasm <crasm@git.vczf.net>
+crasm <crasm@git.vczf.us>
+daboe01 <daboe01@googlemail.com>
+david raistrick <keen99@users.noreply.github.com>
+ddpasa <112642920+ddpasa@users.noreply.github.com>
+deepdiffuser <112834445+deepdiffuser@users.noreply.github.com>
+divinity76 <divinity76@gmail.com>
+dotpy314 <33351922+dotpy314@users.noreply.github.com>
+drbh <david.richard.holtz@gmail.com>
+ds5t5 <145942675+ds5t5@users.noreply.github.com>
+dylan <canardleteer@users.noreply.github.com>
+eastriver <lee@eastriver.dev>
+ebraminio <ebraminio@gmail.com>
+eiery <19350831+eiery@users.noreply.github.com>
+eric8607242 <e0928021388@gmail.com>
+fraxy-v <65565042+fraxy-v@users.noreply.github.com>
+github-actions[bot] <github-actions[bot]@users.noreply.github.com>
+gliptic <gliptic@users.noreply.github.com>
+goerch <jhr.walter@t-online.de>
+grahameth <96447521+grahameth@users.noreply.github.com>
+gwjr <502526+gwjr@users.noreply.github.com>
+h-h-h-h <13482553+h-h-h-h@users.noreply.github.com>
+hankcs <cnhankmc@gmail.com>
+hoangmit <hoangmit@users.noreply.github.com>
+hongbo.mo <352280764@qq.com>
+howlger <eclipse@voormann.de>
+howlger <github@voormann.de>
+hutli <6594598+hutli@users.noreply.github.com>
+hutli <hutli@hutli.hu>
+hutli <jensstaermose@hotmail.com>
+hxer7963 <hxer7963@gmail.com>
+hydai <z54981220@gmail.com>
+iSma <ismail.senhaji@gmail.com>
+iacore <74560659+iacore@users.noreply.github.com>
+igarnier <igarnier@protonmail.com>
+iohub <rickyang.pro@gmail.com>
+jacobi petrucciani <8117202+jpetrucciani@users.noreply.github.com>
+jameswu2014 <545426914@qq.com>
+jneem <joeneeman@gmail.com>
+johnson442 <56517414+johnson442@users.noreply.github.com>
+jon-chuang <9093549+jon-chuang@users.noreply.github.com>
+jp-x-g <jpxg-dev@protonmail.com>
+jwj7140 <32943891+jwj7140@users.noreply.github.com>
+kaizau <kaizau@users.noreply.github.com>
+kalomaze <66376113+kalomaze@users.noreply.github.com>
+kang <tpdns9032100@gmail.com>
+katsu560 <118887472+katsu560@users.noreply.github.com>
+kchro3 <62481661+kchro3@users.noreply.github.com>
+khimaros <me@khimaros.com>
+kiltyj <kiltyj@gmail.com>
+klosax <131523366+klosax@users.noreply.github.com>
+kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
+kunnis <kunnis@users.noreply.github.com>
+kuronekosaiko <EvanChanJ@163.com>
+kuvaus <22169537+kuvaus@users.noreply.github.com>
+kwin1412 <42286931+kwin1412@users.noreply.github.com>
+l3utterfly <gc.pthzfoldr@gmail.com>
+ldwang <ftgreat@163.com>
+le.chang <cljs118@126.com>
+leejet <leejet714@gmail.com>
+limitedAtonement <limitedAtonement@users.noreply.github.com>
+lon <114724657+longregen@users.noreply.github.com>
+m3ndax <adrian.goessl@outlook.com>
+maddes8cht <55592906+maddes8cht@users.noreply.github.com>
+makomk <makosoft@googlemail.com>
+manikbhandari <mbbhandarimanik2@gmail.com>
+mdrokz <mohammadmunshi@gmail.com>
+mgroeber9110 <45620825+mgroeber9110@users.noreply.github.com>
+minarchist <minarchist@users.noreply.github.com>
+mj-shifu <77107165+mj-shifu@users.noreply.github.com>
+mmyjona <jonathan.gonse@gmail.com>
+momonga <115213907+mmnga@users.noreply.github.com>
+moritzbrantner <31051084+moritzbrantner@users.noreply.github.com>
+mzcu <milos.cubrilo@gmail.com>
+nanahi <130121847+na-na-hi@users.noreply.github.com>
+ngc92 <7938269+ngc92@users.noreply.github.com>
+nhamanasu <45545786+nhamanasu@users.noreply.github.com>
+niansa/tuxifan <anton-sa@web.de>
+niansa/tuxifan <tuxifan@posteo.de>
+ningshanwutuobang <ningshanwutuobang@gmail.com>
+nold <Nold360@users.noreply.github.com>
+nopperl <54780682+nopperl@users.noreply.github.com>
+nusu-github <29514220+nusu-github@users.noreply.github.com>
+olexiyb <olexiyb@gmail.com>
+oobabooga <112222186+oobabooga@users.noreply.github.com>
+opparco <parco.opaai@gmail.com>
+ostix360 <55257054+ostix360@users.noreply.github.com>
+perserk <perserk@gmail.com>
+postmasters <namnguyen@google.com>
+pudepiedj <pudepiedj@gmail.com>
+qingfengfenga <41416092+qingfengfenga@users.noreply.github.com>
+qouoq <qouoq@fastmail.com>
+qunash <anzoria@gmail.com>
+rabidcopy <rabidcopy@yahoo.com>
+rankaiyx <rankaiyx@rankaiyx.com>
+rhjdvsgsgks <26178113+rhjdvsgsgks@users.noreply.github.com>
+rhuddleston <ryan.huddleston@percona.com>
+rimoliga <53384203+rimoliga@users.noreply.github.com>
+runfuture <runfuture@users.noreply.github.com>
+sandyiscool <sandyiscool@gmail.com>
+semidark <me@semidark.net>
+sharpHL <132747147+sharpHL@users.noreply.github.com>
+shibe2 <shibe@tuta.io>
+singularity <12184989+singularity-s0@users.noreply.github.com>
+sjinzh <sjinzh@gmail.com>
+slaren <2141330+slaren@users.noreply.github.com>
+slaren <slarengh@gmail.com>
+snadampal <87143774+snadampal@users.noreply.github.com>
+staviq <staviq@gmail.com>
+stduhpf <stephduh@live.fr>
+swittk <switt1995@gmail.com>
+takov751 <40316768+takov751@users.noreply.github.com>
+tarcey <cey.tarik@gmail.com>
+texmex76 <40733439+texmex76@users.noreply.github.com>
+thement <40525767+thement@users.noreply.github.com>
+tjohnman <tjohnman@users.noreply.github.com>
+tslmy <tslmy@users.noreply.github.com>
+ubik2 <ubik2@users.noreply.github.com>
+uint256_t <konndennsa@gmail.com>
+uint256_t <maekawatoshiki1017@gmail.com>
+unbounded <haakon@likedan.net>
+valiray <133289098+valiray@users.noreply.github.com>
+vodkaslime <646329483@qq.com>
+vvhg1 <94630311+vvhg1@users.noreply.github.com>
+vxiiduu <73044267+vxiiduu@users.noreply.github.com>
+wbpxre150 <100937007+wbpxre150@users.noreply.github.com>
+whoreson <139810751+whoreson@users.noreply.github.com>
+wonjun Jang <strutive07@gmail.com>
+wzy <32936898+Freed-Wu@users.noreply.github.com>
+xaedes <xaedes@gmail.com>
+xaedes <xaedes@googlemail.com>
+xloem <0xloem@gmail.com>
+yangli2 <yangli2@gmail.com>
+yuiseki <yuiseki@gmail.com>
+zakkor <edward.partenie@gmail.com>
+zhouwg <6889919+zhouwg@users.noreply.github.com>
+zrm <trustiosity.zrm@gmail.com>
+源文雨 <41315874+fumiama@users.noreply.github.com>
+Нияз Гарифзянов <112617865+garrnizon@users.noreply.github.com>
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -113,6 +113,9 @@ option(LLAMA_METAL                           "llama: use Metal"
 option(LLAMA_METAL_NDEBUG                    "llama: disable Metal debugging"                   OFF)
 option(LLAMA_METAL_SHADER_DEBUG              "llama: compile Metal with -fno-fast-math"         OFF)
 option(LLAMA_METAL_EMBED_LIBRARY             "llama: embed Metal library"                       OFF)
+set(LLAMA_METAL_MACOSX_VERSION_MIN "" CACHE STRING
+                                             "llama: metal minimum macOS version")
+set(LLAMA_METAL_STD "" CACHE STRING          "llama: metal standard version (-std flag)")
 option(LLAMA_KOMPUTE                         "llama: use Kompute"                               OFF)
 option(LLAMA_MPI                             "llama: use MPI"                                   OFF)
 option(LLAMA_QKK_64                          "llama: use super-block size of 64 for k-quants"   OFF)
@@ -250,6 +253,16 @@ if (LLAMA_METAL)
            set(XC_FLAGS -O3)
        endif()

+        # Append macOS metal versioning flags
+        if (LLAMA_METAL_MACOSX_VERSION_MIN)
+            message(STATUS "Adding -mmacosx-version-min=${LLAMA_METAL_MACOSX_VERSION_MIN} flag to metal compilation")
+            list(APPEND XC_FLAGS -mmacosx-version-min=${LLAMA_METAL_MACOSX_VERSION_MIN})
+        endif()
+        if (LLAMA_METAL_STD)
+            message(STATUS "Adding -std=${LLAMA_METAL_STD} flag to metal compilation")
+            list(APPEND XC_FLAGS -std=${LLAMA_METAL_STD})
+        endif()
+
        add_custom_command(
            OUTPUT ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/default.metallib
            COMMAND xcrun -sdk macosx metal    ${XC_FLAGS} -c ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.metal -o ${CMAKE_RUNTIME_OUTPUT_DIRECTORY}/ggml-metal.air
--- a/2
+++ b/2
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2023 Georgi Gerganov
+Copyright (c) 2023-2024 The ggml authors

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/README-sycl.md
+++ b/README-sycl.md
@@ -3,7 +3,7 @@
 - [Background](#background)
 - [News](#news)
 - [OS](#os)
- [Intel GPU](#intel-gpu)
+- [Supported Devices](#supported-devices)
 - [Docker](#docker)
 - [Linux](#linux)
 - [Windows](#windows)
@@ -14,17 +14,25 @@

 ## Background

-SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.
+**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17.

-oneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.
+**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:

-Intel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.
+- **DPCPP** *(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers.
+- **oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. oneMKL - Math Kernel Library)*.
+- **oneAPI LevelZero**: A high performance low level interface for fine-grained control over intel iGPUs and dGPUs.
+- **Nvidia & AMD Plugins**: These are plugins extending oneAPI's DPCPP support to SYCL on Nvidia and AMD GPU targets.

-To avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.
+### Llama.cpp + SYCL
+This SYCL "backend" follows the same design found in other llama.cpp BLAS-based paths such as *OpenBLAS, cuBLAS, CLBlast etc..*. The oneAPI's [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) open-source migration tool (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) was used for this purpose.

-The llama.cpp for SYCL is used to support Intel GPUs.
+The llama.cpp SYCL backend supports:
+- Intel GPUs.
+- Nvidia GPUs.

-For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
+*Upcoming support: AMD GPUs*.
+
+When targetting **Intel CPUs**, it is recommended to  use llama.cpp for [x86_64](README.md#intel-onemkl) approach.

 ## News

@@ -51,9 +59,16 @@ For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
 |Windows|Support|Windows 11|


-## Intel GPU
+## Supported devices

-### Verified
+### Intel GPUs
+
+The oneAPI Math Kernel Library, which the oneAPI base-toolkit includes, supports intel GPUs. In order to make it "visible", simply run the following:
+```sh
+source /opt/intel/oneapi/setvars.sh
+```
+
+- **Tested devices**

 |Intel GPU| Status | Verified Model|
 |-|-|-|
@@ -63,198 +78,229 @@ For Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).
 |Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
 |Intel iGPU| Support| iGPU in i5-1250P, i7-1260P, i7-1165G7|

-Note: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.
+*Notes:*

-### Memory
+- Device memory can be a limitation when running a large model on an intel GPU. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/main`.

-The memory is a limitation to run LLM on GPUs.
+- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPUs and 4.0GB for discrete GPUs.

-When run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like `llm_load_tensors:            buffer size =  3577.56 MiB`.
+- If the iGPU has less than 80  EUs *(Execution Unit)*, the inference speed will likely be too slow for practical use.

-For iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.
+### Nvidia GPUs
+The BLAS acceleration on Nvidia GPUs through oneAPI can be obtained using the Nvidia plugins for oneAPI and the cuBLAS backend of the upstream oneMKL library. Details and instructions on how to setup the runtime and library can be found in [this section](#i-setup-environment)

-For dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.
+- **Tested devices**

-## Nvidia GPU
-
-### Verified
-
-|Intel GPU| Status | Verified Model|
+|Nvidia GPU| Status | Verified Model|
 |-|-|-|
-|Ampere Series| Support| A100|
+|Ampere Series| Support| A100, A4000|
+|Ampere Series *(Mobile)*| Support| RTX 40 Series|

-### oneMKL for CUDA
+*Notes:*
+  - Support for Nvidia targets through oneAPI is currently limited to Linux platforms.

-The current oneMKL release does not contain the oneMKL cuBlas backend.
-As a result for Nvidia GPU's oneMKL must be built from source.
+  - Please make sure the native oneAPI MKL *(dedicated to intel CPUs and GPUs)* is not "visible" at this stage to properly setup and use the built-from-source oneMKL with cuBLAS backend in llama.cpp for Nvidia GPUs.

-```
-git clone https://github.com/oneapi-src/oneMKL
-cd oneMKL
-mkdir build
-cd build
-cmake -G Ninja .. -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON
-ninja
-// Add paths as necessary
-```

 ## Docker
-
-Note:
- Only docker on Linux is tested. Docker on WSL may not work.
- You may need to install Intel GPU driver on the host machine (See the [Linux](#linux) section to know how to do that)
-
-### Build the image
-
-You can choose between **F16** and **F32** build. F16 is faster for long-prompt inference.
-
-
+The docker build option is currently limited to *intel GPU* targets.
+### Build image
 ```sh
-# For F16:
-#docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .
-
-# Or, for F32:
-docker build -t llama-cpp-sycl -f .devops/main-intel.Dockerfile .
-
-# Note: you can also use the ".devops/server-intel.Dockerfile", which compiles the "server" example
+# Using FP16
+docker build -t llama-cpp-sycl --build-arg="LLAMA_SYCL_F16=ON" -f .devops/main-intel.Dockerfile .
 ```

-### Run
+*Notes*:
+
+To build in default FP32 *(Slower than FP16 alternative)*, you can remove the `--build-arg="LLAMA_SYCL_F16=ON"` argument from the previous command.
+
+You can also use the `.devops/server-intel.Dockerfile`, which builds the *"server"* alternative.
+
+### Run container

 ```sh
-# Firstly, find all the DRI cards:
+# First, find all the DRI cards
 ls -la /dev/dri
-# Then, pick the card that you want to use.
-
-# For example with "/dev/dri/card1"
+# Then, pick the card that you want to use (here for e.g. /dev/dri/card1).
 docker run -it --rm -v "$(pwd):/app:Z" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m "/app/models/YOUR_MODEL_FILE" -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33
 ```

+*Notes:*
+- Docker has been tested successfully on native Linux. WSL support has not been verified yet.
+- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
+
 ## Linux

-### Setup Environment
+### I. Setup Environment

-1. Install Intel GPU driver.
+1. **Install GPU drivers**

-a. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).
+  - **Intel GPU**

-Note: for iGPU, please install the client GPU driver.
+Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps).

-b. Add user to group: video, render.
+*Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html).
+
+Once installed, add the user(s) to the `video` and `render` groups.

 ```sh
-sudo usermod -aG render username
-sudo usermod -aG video username
+sudo usermod -aG render $USER
+sudo usermod -aG video $USER
 ```

-Note: re-login to enable it.
+*Note*: logout/re-login for the changes to take effect.

-c. Check
+Verify installation through `clinfo`:

 ```sh
 sudo apt install clinfo
 sudo clinfo -l
 ```

-Output (example):
+Sample output:

-```
+```sh
 Platform #0: Intel(R) OpenCL Graphics
 `-- Device #0: Intel(R) Arc(TM) A770 Graphics

-
 Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
 ```

-2. Install Intel® oneAPI Base toolkit.
+- **Nvidia GPU**

-a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
+In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cublas)-* are installed.
+Installation can be verified by running the following:
+```sh
+nvidia-smi
+```
+Please make sure at least one CUDA device is available, which can be displayed like this *(here an A100-40GB Nvidia GPU)*:
+```
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:8D:00.0 Off |                    0 |
+| N/A   36C    P0              57W / 250W |      4MiB / 40960MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+```

-Recommend to install to default folder: **/opt/intel/oneapi**.

-Following guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.
+2. **Install Intel® oneAPI Base toolkit**

-b. Check
+- **Base installation**
+
+The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
+
+Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*.
+
+Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
+
+Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI MKL for intel GPUs.
+
+- **Adding support to Nvidia GPUs**
+
+**oneAPI**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
+
+
+**oneMKL**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.

 ```sh
-source /opt/intel/oneapi/setvars.sh
+git clone https://github.com/oneapi-src/oneMKL
+cd oneMKL
+mkdir -p buildWithCublas && cd buildWithCublas
+cmake ../ -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=OFF -DENABLE_MKLCPU_BACKEND=OFF -DENABLE_CUBLAS_BACKEND=ON -DTARGET_DOMAINS=blas
+make
+```

+
+3. **Verify installation and environment**
+
+In order to check the available SYCL devices on the machine, please use the `sycl-ls` command.
+```sh
+source /opt/intel/oneapi/setvars.sh
 sycl-ls
 ```

-There should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.
+- **Intel GPU**
+
+When targeting an intel GPU, the user should expect one or more level-zero devices among the available SYCL devices. Please make sure that at least one GPU is present, for instance [`ext_oneapi_level_zero:gpu:0`] in the sample output below:

-Output (example):
 ```
 [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
 [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
 [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]
-
 ```

-2. Build locally:
+- **Nvidia GPU**

-Note:
- You can choose between **F16** and **F32** build. F16 is faster for long-prompt inference.
- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.
+Similarly, user targetting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
+```
+[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
+[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
+[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.0 [CUDA 12.2]
+```

+### II. Build llama.cpp
+
+#### Intel GPU
 ```sh
-mkdir -p build
-cd build
+# Export relevant ENV variables
 source /opt/intel/oneapi/setvars.sh

-# For FP16:
-#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
+# Build LLAMA with MKL BLAS acceleration for intel GPU
+mkdir -p build && cd build

-# Or, for FP32:
+# Option 1: Use FP16 for better performance in long-prompt  inference
+cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
+
+# Option 2: Use FP32 by default
 cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
-
-# For Nvidia GPUs
-cmake .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
-
-# Build example/main only
-#cmake --build . --config Release --target main
-
-# Or, build all binary
-cmake --build . --config Release -v
-
-cd ..
 ```

-or
-
+#### Nvidia GPU
 ```sh
-./examples/sycl/build.sh
+# Export relevant ENV variables
+export LD_LIBRARY_PATH=/path/to/oneMKL/buildWithCublas/lib:$LD_LIBRARY_PATH
+export LIBRARY_PATH=/path/to/oneMKL/buildWithCublas/lib:$LIBRARY_PATH
+export CPLUS_INCLUDE_DIR=/path/to/oneMKL/buildWithCublas/include:$CPLUS_INCLUDE_DIR
+export CPLUS_INCLUDE_DIR=/path/to/oneMKL/include:$CPLUS_INCLUDE_DIR
+
+# Build LLAMA with Nvidia BLAS acceleration through SYCL
+mkdir -p build && cd build
+
+# Option 1: Use FP16 for better performance in long-prompt  inference
+cmake .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON
+
+# Option 2: Use FP32 by default
+cmake .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
 ```

-### Run
+### III. Run the inference

-1. Put model file to folder **models**
+1. Retrieve and prepare model

-You could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.
+You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.

 2. Enable oneAPI running environment

-```
+```sh
 source /opt/intel/oneapi/setvars.sh
 ```

-3. List device ID
+3. List devices information

-Run without parameter:
+Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:

 ```sh
 ./build/bin/ls-sycl-device
-
-# or running the "main" executable and look at the output log:
-
-./build/bin/main
 ```
-
-Check the ID in startup log, like:
-
+A example of such log in a system with 1 *intel CPU* and 1 *intel GPU* can look like the following:
 ```
 found 6 SYCL devices:
 |  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
@@ -270,15 +316,15 @@ found 6 SYCL devices:

 |Attribute|Note|
 |-|-|
-|compute capability 1.3|Level-zero running time, recommended |
-|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
+|compute capability 1.3|Level-zero driver/runtime, recommended |
+|compute capability 3.0|OpenCL driver/runtime, slower than level-zero in most cases|

-4. Device selection and execution of llama.cpp
+4. Launch inference

 There are two device selection modes:

- Single device: Use one device assigned by user.
- Multiple devices: Automatically choose the devices with the same biggest Max compute units.
+- Single device: Use one device target specified by the user.
+- Multiple devices: Automatically select the devices with the same largest Max compute-units.

 |Device selection|Parameter|
 |-|-|
@@ -303,74 +349,64 @@ or run by script:
 ```sh
 ZES_ENABLE_SYSMAN=1 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
 ```
-or run by script:
+
+Otherwise, you can run the script:

 ```sh
 ./examples/sycl/run_llama2.sh
 ```

-Note:
+*Notes:*

- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.
+- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `/bin/main` if faced with the issue.
+- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:

-
-5. Verify the device ID in output
-
-Verify to see if the selected GPU is shown in the output, like:
-
-```
+```sh
 detect 1 SYCL GPUs: [0] with top Max compute units:512
 ```
 Or
-```
+```sh
 use 1 SYCL GPUs: [0] with Max compute units:512
 ```

-
 ## Windows

-### Setup Environment
+### I. Setup Environment

-1. Install Intel GPU driver.
+1. Install GPU driver

-Please install Intel GPU driver by official guide: [Install GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
+Intel GPU drivers instructions guide and download page can be found here: [Get intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).

-Note: **The driver is mandatory for compute function**.
+2. Install Visual Studio

-2. Install Visual Studio.
+If you already have a recent version of Microsoft Visual Studio, you can skip this step. Otherwise, please refer to the official download page for [Microsoft Visual Studio](https://visualstudio.microsoft.com/).

-Please install [Visual Studio](https://visualstudio.microsoft.com/) which impact oneAPI environment enabling in Windows.
+3. Install Intel® oneAPI Base toolkit

-3. Install Intel® oneAPI Base toolkit.
+The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.

-a. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).
+Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*.

-Recommend to install to default folder: **C:\Program Files (x86)\Intel\oneAPI**.
-
-Following guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.
+Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.

 b. Enable oneAPI running environment:

- In Search, input 'oneAPI'.
+- Type "oneAPI" in the search bar, then open the `Intel oneAPI command prompt for Intel 64 for Visual Studio 2022` App.

-Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"
-
- In Run:
-
-In CMD:
+- On the command prompt, enable the runtime environment with the following:
 ```
 "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
 ```

-c. Check GPU
+c. Verify installation

-In oneAPI command line:
+In the oneAPI command line, run the following to print the available SYCL devices:

 ```
 sycl-ls
 ```

-There should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.
+There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:

 Output (example):
 ```
@@ -380,7 +416,7 @@ Output (example):
 [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
 ```

-4. Install cmake & make
+4. Install build tools

 a. Download & install cmake for Windows: https://cmake.org/download/

@@ -390,76 +426,53 @@ b. Download & install mingw-w64 make for Windows provided by w64devkit

 - Extract `w64devkit` on your pc.

- Add the **bin** folder path in the Windows system PATH environment, like `C:\xxx\w64devkit\bin\`.
+- Add the **bin** folder path in the Windows system PATH environment (for e.g. `C:\xxx\w64devkit\bin\`).

-### Build locally:
+### II. Build llama.cpp

-In oneAPI command line window:
+On the oneAPI command line window, step into the llama.cpp main directory and run the following:

 ```
 mkdir -p build
 cd build
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

-::  for FP16
-::  faster for long-prompt inference
-::  cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON
+cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON

-::  for FP32
-cmake -G "MinGW Makefiles" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release
-
-
-::  build example/main only
-::  make main
-
-::  build all binary
-make -j
-cd ..
+make
 ```

-or
-
-```
+Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
+```sh
 .\examples\sycl\win-build-sycl.bat
 ```

-Note:
+*Notes:*

- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.
+- By default, calling `make` will build all target binary files. In case of a minimal experimental setup, the user can build the inference executable only through `make main`.

-### Run
+### III. Run the inference

-1. Put model file to folder **models**
+1. Retrieve and prepare model

-You could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.
+You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.

 2. Enable oneAPI running environment

- In Search, input 'oneAPI'.
-
-Search & open "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022"
-
- In Run:
-
-In CMD:
+On the oneAPI command line window, run the following and step into the llama.cpp directory:
 ```
 "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
 ```

-3. List device ID
+3. List devices information

-Run without parameter:
+Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:

 ```
 build\bin\ls-sycl-device.exe
-
-or
-
-build\bin\main.exe
 ```

-Check the ID in startup log, like:
-
+The output of this command in a system with 1 *intel CPU* and 1 *intel GPU* would look like the following:
 ```
 found 6 SYCL devices:
 |  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
@@ -480,7 +493,7 @@ found 6 SYCL devices:
 |compute capability 3.0|OpenCL running time, slower than level-zero in most cases|


-4. Device selection and execution of llama.cpp
+4. Launch inference

 There are two device selection modes:

@@ -505,7 +518,7 @@ build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be
 ```
 build\bin\main.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
 ```
-or run by script:
+Otherwise, run the following wrapper script:

 ```
 .\examples\sycl\win-run-llama2.bat
@@ -513,19 +526,14 @@ or run by script:

 Note:

- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.
+- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `main.exe` if faced with the issue.
+- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:

-
-
-5. Verify the device ID in output
-
-Verify to see if the selected GPU is shown in the output, like:
-
-```
+```sh
 detect 1 SYCL GPUs: [0] with top Max compute units:512
 ```
 Or
-```
+```sh
 use 1 SYCL GPUs: [0] with Max compute units:512
 ```

@@ -535,64 +543,54 @@ use 1 SYCL GPUs: [0] with Max compute units:512

 |Name|Value|Function|
 |-|-|-|
-|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. <br>For FP32/FP16, LLAMA_SYCL=ON is mandatory.|
-|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. <br>For FP32, not set it.|
-|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|
-|CMAKE_CXX_COMPILER|icpx (Linux), icx (Windows)|use icpx/icx for SYCL code path|
-
-#### Running
+|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path.|
+|LLAMA_SYCL_TARGET | INTEL *(default)* \| NVIDIA|Set the SYCL target device type.|
+|LLAMA_SYCL_F16|OFF *(default)* \|ON *(optional)*|Enable FP16 build with SYCL code path.|
+|CMAKE_C_COMPILER|icx|Set *icx* compiler for SYCL code path.|
+|CMAKE_CXX_COMPILER|icpx *(Linux)*, icx *(Windows)*|Set `icpx/icx` compiler for SYCL code path.|

+#### Runtime

 |Name|Value|Function|
 |-|-|-|
 |GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
 |ZES_ENABLE_SYSMAN| 0 (default) or 1|Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer|

-## Known Issue
+## Known Issues

- Hang during startup
+- Hanging during startup

-  llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.
+  llama.cpp uses *mmap* as the default mode for reading the model file and copying it to the GPU. In some systems, `memcpy` might behave abnormally and therefore hang.

-  Solution: add **--no-mmap** or **--mmap 0**.
+  - **Solution**: add `--no-mmap` or `--mmap 0` flag to the `main` executable.

- Split-mode: [row] is not supported
-
-  It's on developing.
+- `Split-mode:[row]` is not supported.

 ## Q&A

-Note: please add prefix **[SYCL]** in issue title, so that we will check it as soon as possible.
-
-
 - Error:  `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.

-  Miss to enable oneAPI running environment.
+  - Potential cause: Unavailable oneAPI installation or not set ENV variables.
+  - Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`.

-  Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.
+- General compiler error:

- In Windows, no result, not error.
+  - Remove build folder or try a clean-build.

-  Miss to enable oneAPI running environment.
+- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux.

- Meet compile error.
+  Please double-check with `sudo sycl-ls`.

-  Remove folder **build** and try again.
-
- I can **not** see **[ext_oneapi_level_zero:gpu:0]** afer install GPU driver in Linux.
-
-  Please run **sudo sycl-ls**.
-
-  If you see it in result, please add video/render group to your ID:
+  If it's present in the list, please add video/render group to your user then **logout/login** or restart your system:

  ```
-  sudo usermod -aG render username
-  sudo usermod -aG video username
+  sudo usermod -aG render $USER
+  sudo usermod -aG video $USER
  ```
+  Otherwise, please double-check the GPU driver installation steps.

-  Then **relogin**.
-
-  If you do not see it, please check the installation GPU steps again.
+### **GitHub contribution**:
+Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.

 ## Todo

--- a/README.md
+++ b/README.md
@@ -115,6 +115,7 @@ Typically finetunes of the base models below are supported as well.
 - [x] [CodeShell](https://github.com/WisdomShell/codeshell)
 - [x] [Gemma](https://ai.google.dev/gemma)
 - [x] [Mamba](https://github.com/state-spaces/mamba)
+- [x] [Xverse](https://huggingface.co/models?search=xverse)
 - [x] [Command-R](https://huggingface.co/CohereForAI/c4ai-command-r-v01)

 **Multimodal models:**
@@ -175,6 +176,9 @@ Unless otherwise noted these projects are open-source with permissive licensing:
 - [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
 - [Msty](https://msty.app) (proprietary)
 - [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
+- [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apachev2.0 or later)
+
+*(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*

 ---

@@ -632,15 +636,6 @@ Building the program with BLAS support may lead to some performance improvements

 - #### Vulkan

-> [!WARNING]
->
-> Vulkan support has been broken in https://github.com/ggerganov/llama.cpp/pull/6122
-> due to relying on `GGML_OP_GET_ROWS` which is not yet properly supported by the Vulkan backend,
-> but should be fixed relatively soon (possibly in https://github.com/ggerganov/llama.cpp/pull/6155
-> (ref: https://github.com/ggerganov/llama.cpp/pull/6122#issuecomment-2015327635)).
->
-> Meanwhile, if you want to use the Vulkan backend, you should use the commit right before the breaking change, https://github.com/ggerganov/llama.cpp/commit/55c1b2a3bbd470e9e2a3a0618b92cf64a885f806
-
  **With docker**:

  You don't need to install Vulkan SDK. It will be installed inside the container.
--- a/ci/run.sh
+++ b/ci/run.sh
@@ -575,7 +575,7 @@ function gg_run_embd_bge_small {
    cd ${SRC}

    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/config.json
-    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/resolve/main/tokenizer.model
+    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer.json
    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/tokenizer_config.json
    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/raw/main/special_tokens_map.json
    gg_wget models-mnt/bge-small/ https://huggingface.co/BAAI/bge-small-en-v1.5/resolve/main/pytorch_model.bin
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -23,7 +23,7 @@ if 'NO_LOCAL_GGUF' not in os.environ:
    sys.path.insert(1, str(Path(__file__).parent / 'gguf-py'))
 import gguf

-from convert import HfVocab
+from convert import LlamaHfVocab, permute


 ###### MODEL DEFINITIONS ######
@@ -230,7 +230,7 @@ class Model(ABC):
    def _set_vocab_gpt2(self):
        dir_model = self.dir_model
        hparams = self.hparams
-        tokens: list[bytearray] = []
+        tokens: list[str] = []
        toktypes: list[int] = []

        from transformers import AutoTokenizer
@@ -243,8 +243,7 @@ class Model(ABC):

        for i in range(vocab_size):
            if i not in reverse_vocab:
-                pad_token = f"[PAD{i}]".encode('utf-8')
-                tokens.append(bytearray(pad_token))
+                tokens.append(f"[PAD{i}]")
                toktypes.append(gguf.TokenType.USER_DEFINED)
            elif reverse_vocab[i] in added_vocab:
                tokens.append(reverse_vocab[i])
@@ -266,7 +265,7 @@ class Model(ABC):
    def _set_vocab_qwen(self):
        dir_model = self.dir_model
        hparams = self.hparams
-        tokens: list[bytearray] = []
+        tokens: list[str] = []
        toktypes: list[int] = []

        from transformers import AutoTokenizer
@@ -291,8 +290,7 @@ class Model(ABC):

        for i in range(vocab_size):
            if i not in reverse_vocab:
-                pad_token = f"[PAD{i}]".encode("utf-8")
-                tokens.append(bytearray(pad_token))
+                tokens.append(f"[PAD{i}]")
                toktypes.append(gguf.TokenType.USER_DEFINED)
            elif reverse_vocab[i] in added_vocab:
                tokens.append(reverse_vocab[i])
@@ -372,12 +370,8 @@ class Model(ABC):
        special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
        special_vocab.add_to_gguf(self.gguf_writer)

-    def _set_vocab_hf(self):
-        path = self.dir_model
-        added_tokens_path = self.dir_model
-        vocab = HfVocab(
-            path, added_tokens_path if added_tokens_path.exists() else None
-        )
+    def _set_vocab_llama_hf(self):
+        vocab = LlamaHfVocab(self.dir_model)
        tokens = []
        scores = []
        toktypes = []
@@ -779,6 +773,148 @@ class BaichuanModel(Model):
        return weights[r * n_part:r * n_part + r, ...]


+@Model.register("XverseForCausalLM")
+class XverseModel(Model):
+    model_arch = gguf.MODEL_ARCH.XVERSE
+
+    def set_vocab(self):
+        assert (self.dir_model / "tokenizer.json").is_file()
+        dir_model = self.dir_model
+        hparams = self.hparams
+
+        tokens: list[bytearray] = []
+        toktypes: list[int] = []
+
+        from transformers import AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(dir_model)
+        vocab_size = hparams.get("vocab_size", len(tokenizer.vocab))
+        assert max(tokenizer.vocab.values()) < vocab_size
+
+        reverse_vocab = {id_: encoded_tok for encoded_tok, id_ in tokenizer.vocab.items()}
+        added_vocab = tokenizer.get_added_vocab()
+
+        for token_id in range(vocab_size):
+            token_text = reverse_vocab[token_id].encode('utf-8')
+            # replace "\x00" to string with length > 0
+            if token_text == b"\x00":
+                toktype = gguf.TokenType.BYTE  # special
+                token_text = f"<{token_text}>".encode('utf-8')
+            elif re.fullmatch(br"<0x[0-9A-Fa-f]{2}>", token_text):
+                toktype = gguf.TokenType.BYTE  # special
+            elif reverse_vocab[token_id] in added_vocab:
+                if tokenizer.added_tokens_decoder[token_id].special:
+                    toktype = gguf.TokenType.CONTROL
+                else:
+                    toktype = gguf.TokenType.USER_DEFINED
+            else:
+                toktype = gguf.TokenType.NORMAL
+
+            tokens.append(token_text)
+            toktypes.append(toktype)
+
+        self.gguf_writer.add_tokenizer_model("llama")
+        self.gguf_writer.add_token_list(tokens)
+        self.gguf_writer.add_token_types(toktypes)
+
+        special_vocab = gguf.SpecialVocab(dir_model, n_vocab=len(tokens))
+        special_vocab.add_to_gguf(self.gguf_writer)
+
+    def set_gguf_parameters(self):
+        block_count = self.hparams["num_hidden_layers"]
+        head_count = self.hparams["num_attention_heads"]
+        head_count_kv = self.hparams.get("num_key_value_heads", head_count)
+        hf_repo = self.hparams.get("_name_or_path", "")
+
+        ctx_length = 0
+        if "max_sequence_length" in self.hparams:
+            ctx_length = self.hparams["max_sequence_length"]
+        elif "max_position_embeddings" in self.hparams:
+            ctx_length = self.hparams["max_position_embeddings"]
+        elif "model_max_length" in self.hparams:
+            ctx_length = self.hparams["model_max_length"]
+        else:
+            print("gguf: can not find ctx length parameter.")
+            sys.exit()
+
+        self.gguf_writer.add_name(self.dir_model.name)
+        self.gguf_writer.add_source_hf_repo(hf_repo)
+        self.gguf_writer.add_tensor_data_layout("Meta AI original pth")
+        self.gguf_writer.add_context_length(ctx_length)
+        self.gguf_writer.add_embedding_length(self.hparams["hidden_size"])
+        self.gguf_writer.add_block_count(block_count)
+        self.gguf_writer.add_feed_forward_length(self.hparams["intermediate_size"])
+        self.gguf_writer.add_rope_dimension_count(self.hparams["hidden_size"] // self.hparams["num_attention_heads"])
+        self.gguf_writer.add_head_count(head_count)
+        self.gguf_writer.add_head_count_kv(head_count_kv)
+        self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
+
+        if self.hparams.get("rope_scaling") is not None and "factor" in self.hparams["rope_scaling"]:
+            if self.hparams["rope_scaling"].get("type") == "linear":
+                self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.LINEAR)
+                self.gguf_writer.add_rope_scaling_factor(self.hparams["rope_scaling"]["factor"])
+
+    def write_tensors(self):
+        # Collect tensors from generator object
+        model_kv = dict(self.get_tensors())
+        block_count = self.hparams["num_hidden_layers"]
+        head_count = self.hparams["num_attention_heads"]
+        tensor_map = gguf.get_tensor_name_map(self.model_arch, block_count)
+        head_count_kv = self.hparams.get("num_key_value_heads", head_count)
+
+        for name, data_torch in model_kv.items():
+            # we don't need these
+            if name.endswith(".rotary_emb.inv_freq"):
+                continue
+
+            old_dtype = data_torch.dtype
+
+            # convert any unsupported data types to float32
+            if data_torch.dtype not in (torch.float16, torch.float32):
+                data_torch = data_torch.to(torch.float32)
+
+            # HF models permute some of the tensors, so we need to undo that
+            if name.endswith(("q_proj.weight")):
+                data_torch = self._reverse_hf_permute(data_torch, head_count, head_count)
+            if name.endswith(("k_proj.weight")):
+                data_torch = self._reverse_hf_permute(data_torch, head_count, head_count_kv)
+
+            data = data_torch.squeeze().numpy()
+
+            # map tensor names
+            new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
+            if new_name is None:
+                print(f"Can not map tensor {name!r}")
+                sys.exit()
+
+            n_dims = len(data.shape)
+            data_dtype = data.dtype
+
+            # if f32 desired, convert any float16 to float32
+            if self.ftype == 0 and data_dtype == np.float16:
+                data = data.astype(np.float32)
+
+            # TODO: Why cant we use these float16 as-is? There should be not reason to store float16 as float32
+            if self.ftype == 1 and data_dtype == np.float16 and n_dims == 1:
+                data = data.astype(np.float32)
+
+            # if f16 desired, convert any float32 2-dim weight tensors to float16
+            if self.ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2:
+                data = data.astype(np.float16)
+
+            print(f"{name} -> {new_name}, n_dims = {n_dims}, {old_dtype} --> {data.dtype}")
+            self.gguf_writer.add_tensor(new_name, data)
+
+    def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
+        if n_kv_head is not None and n_head != n_kv_head:
+            n_head //= n_kv_head
+
+        return (
+            weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:])
+            .swapaxes(1, 2)
+            .reshape(weights.shape)
+        )
+
+
@Model.register("FalconForCausalLM", "RWForCausalLM")
 class FalconModel(Model):
    model_arch = gguf.MODEL_ARCH.FALCON
@@ -1058,12 +1194,72 @@ class StableLMModel(Model):
        self.gguf_writer.add_layer_norm_eps(self.find_hparam(["layer_norm_eps", "norm_eps"]))


-@Model.register("MixtralForCausalLM")
-class MixtralModel(Model):
+@Model.register("LlamaForCausalLM", "MistralForCausalLM", "MixtralForCausalLM")
+class LlamaModel(Model):
    model_arch = gguf.MODEL_ARCH.LLAMA

    def set_vocab(self):
-        self._set_vocab_sentencepiece()
+        try:
+            self. _set_vocab_sentencepiece()
+        except FileNotFoundError:
+            self._set_vocab_llama_hf()
+
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+        hparams = self.hparams
+        self.gguf_writer.add_vocab_size(hparams["vocab_size"])
+        self.gguf_writer.add_rope_dimension_count(hparams["hidden_size"] // hparams["num_attention_heads"])
+
+    # Same as super class, but permuting q_proj, k_proj
+    def write_tensors(self):
+        block_count = self.hparams.get("n_layers", self.hparams.get("num_hidden_layers", self.hparams.get("n_layer")))
+        tensor_map = gguf.get_tensor_name_map(self.model_arch, block_count)
+        n_head = self.hparams.get("num_attention_heads")
+        n_kv_head = self.hparams.get("num_key_value_heads")
+        for name, data_torch in self.get_tensors():
+            # we don't need these
+            if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq")):
+                continue
+
+            old_dtype = data_torch.dtype
+
+            # convert any unsupported data types to float32
+            if data_torch.dtype not in (torch.float16, torch.float32):
+                data_torch = data_torch.to(torch.float32)
+
+            data = data_torch.numpy()
+
+            if name.endswith("q_proj.weight"):
+                data = permute(data, n_head, n_head)
+            if name.endswith("k_proj.weight"):
+                data = permute(data, n_head, n_kv_head)
+
+            data = data.squeeze()
+
+            # map tensor names
+            new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
+            if new_name is None:
+                print(f"Can not map tensor {name!r}")
+                sys.exit()
+
+            n_dims = len(data.shape)
+            data_dtype = data.dtype
+
+            # if f32 desired, convert any float16 to float32
+            if self.ftype == 0 and data_dtype == np.float16:
+                data = data.astype(np.float32)
+
+            # TODO: Why cant we use these float16 as-is? There should be not reason to store float16 as float32
+            if self.ftype == 1 and data_dtype == np.float16 and n_dims == 1:
+                data = data.astype(np.float32)
+
+            # if f16 desired, convert any float32 2-dim weight tensors to float16
+            if self.ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2:
+                data = data.astype(np.float16)
+
+            print(f"{new_name}, n_dims = {n_dims}, {old_dtype} --> {data.dtype}")
+
+            self.gguf_writer.add_tensor(new_name, data)


@Model.register("GrokForCausalLM")
@@ -1099,7 +1295,7 @@ class MiniCPMModel(Model):
        self.gguf_writer.add_file_type(self.ftype)

    def set_vocab(self):
-        self._set_vocab_hf()
+        self._set_vocab_llama_hf()

    def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor:
        if n_kv_head is not None and n_head != n_kv_head:
@@ -1700,11 +1896,8 @@ class BertModel(Model):
            self.gguf_writer.add_pooling_type(pooling_type)

    def set_vocab(self):
-        path = self.dir_model
-        added_tokens_path = self.dir_model if self.dir_model.exists() else None
-
        # use huggingface vocab to get all tokens
-        vocab = HfVocab(path, added_tokens_path)
+        vocab = LlamaHfVocab(self.dir_model, ignore_nonllama=True)
        tokens, scores, toktypes = zip(*vocab.all_tokens())
        assert len(tokens) == vocab.vocab_size
        self.vocab_size = vocab.vocab_size
--- a/convert-persimmon-to-gguf.py
+++ b/convert-persimmon-to-gguf.py
@@ -106,12 +106,12 @@ def main():
    tensor_map = gguf.get_tensor_name_map(arch, block_count)
    print(tensor_map)
    for name in tensors.keys():
-        data = tensors[name]
+        data_torch = tensors[name]
        if name.endswith(".self_attention.rotary_emb.inv_freq"):
            continue
-        old_dtype = data.dtype
+        old_dtype = data_torch.dtype
        # TODO: FP16 conversion produces garbage outputs. (Q8_0 does not, so..?)
-        data = data.to(torch.float32).squeeze().numpy()
+        data = data_torch.to(torch.float32).squeeze().numpy()
        new_name = tensor_map.get_name(name, try_suffixes = (".weight", ".bias"))
        if new_name is None:
            print("Can not map tensor '" + name + "'")
--- a/convert.py
+++ b/convert.py
@@ -16,13 +16,14 @@ import re
 import signal
 import struct
 import sys
+import textwrap
 import time
 import zipfile
-from abc import ABCMeta, abstractmethod
+from abc import ABC, abstractmethod
 from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
 from dataclasses import dataclass
 from pathlib import Path
-from typing import IO, TYPE_CHECKING, Any, Callable, Iterable, Literal, TypeVar
+from typing import TYPE_CHECKING, Any, Callable, ClassVar, IO, Iterable, Literal, Protocol, TypeVar, runtime_checkable

 import numpy as np
 from sentencepiece import SentencePieceProcessor
@@ -43,6 +44,9 @@ ARCH = gguf.MODEL_ARCH.LLAMA

 DEFAULT_CONCURRENCY = 8

+ADDED_TOKENS_FILE = 'added_tokens.json'
+FAST_TOKENIZER_FILE = 'tokenizer.json'
+
 #
 # data types
 #
@@ -188,8 +192,10 @@ class Params:
            n_layer = next(i for i in itertools.count() if f"layers.{i}.attention.wq.weight" not in model)

        if n_layer < 1:
-            raise Exception("failed to guess 'n_layer'. This model is unknown or unsupported.\n"
-                            "Suggestion: provide 'config.json' of the model in the same directory containing model files.")
+            msg = """\
+                failed to guess 'n_layer'. This model is unknown or unsupported.
+                Suggestion: provide 'config.json' of the model in the same directory containing model files."""
+            raise KeyError(textwrap.dedent(msg))

        n_head = n_embd // 128 # guessed
        n_mult = 256           # guessed
@@ -211,7 +217,8 @@ class Params:

    @staticmethod
    def loadHFTransformerJson(model: LazyModel, config_path: Path) -> Params:
-        config = json.load(open(config_path))
+        with open(config_path) as f:
+            config = json.load(f)

        rope_scaling_type = f_rope_scale = n_orig_ctx = rope_finetuned = None
        rope_scaling = config.get("rope_scaling")
@@ -233,8 +240,10 @@ class Params:
        elif "max_position_embeddings" in config:
            n_ctx = config["max_position_embeddings"]
        else:
-            raise Exception("failed to guess 'n_ctx'. This model is unknown or unsupported.\n"
-                            "Suggestion: provide 'config.json' of the model in the same directory containing model files.")
+            msg = """\
+                failed to guess 'n_ctx'. This model is unknown or unsupported.
+                Suggestion: provide 'config.json' of the model in the same directory containing model files."""
+            raise KeyError(textwrap.dedent(msg))

        n_experts      = None
        n_experts_used = None
@@ -265,7 +274,8 @@ class Params:
    # {"dim": 8192, "multiple_of": 4096, "ffn_dim_multiplier": 1.3, "n_heads": 64, "n_kv_heads": 8, "n_layers": 80, "norm_eps": 1e-05, "vocab_size": -1}
    @staticmethod
    def loadOriginalParamsJson(model: LazyModel, config_path: Path) -> Params:
-        config = json.load(open(config_path))
+        with open(config_path) as f:
+            config = json.load(f)

        n_experts      = None
        n_experts_used = None
@@ -331,47 +341,86 @@ class Params:
 # vocab
 #

-class BpeVocab:
+@runtime_checkable
+class BaseVocab(Protocol):
+    tokenizer_model: ClassVar[str]
+    name: ClassVar[str]
+
+
+class NoVocab(BaseVocab):
+    tokenizer_model = "no_vocab"
+    name = "no_vocab"
+
+    def __repr__(self) -> str:
+        return "<NoVocab for a model without integrated vocabulary>"
+
+
+@runtime_checkable
+class Vocab(BaseVocab, Protocol):
+    vocab_size: int
+    added_tokens_dict: dict[str, int]
+    added_tokens_list: list[str]
+    fname_tokenizer: Path
+
+    def __init__(self, base_path: Path): ...
+    def all_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]: ...
+
+
+class BpeVocab(Vocab):
    tokenizer_model = "gpt2"
    name = "bpe"

-    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
-        self.bpe_tokenizer = json.loads(open(str(fname_tokenizer), encoding="utf-8").read())
-        if isinstance(self.bpe_tokenizer.get('model'), dict):
-            self.vocab = self.bpe_tokenizer["model"]["vocab"]
-        else:
-            self.vocab = self.bpe_tokenizer
-        added_tokens: dict[str, int]
-        if fname_added_tokens is not None:
-            # FIXME: Verify that added tokens here _cannot_ overlap with the main vocab.
-            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
-        else:
-            # Fall back to trying to find the added tokens in tokenizer.json
-            tokenizer_json_file = fname_tokenizer.parent / 'tokenizer.json'
-            if not tokenizer_json_file.is_file():
-                added_tokens = {}
-            else:
-                tokenizer_json = json.load(open(tokenizer_json_file, encoding="utf-8"))
-                added_tokens = dict(
-                    (item['content'], item['id'])
-                    for item in tokenizer_json.get('added_tokens', [])
-                    # Added tokens here can be duplicates of the main vocabulary.
-                    if item['content'] not in self.bpe_tokenizer)
+    def __init__(self, base_path: Path):
+        added_tokens: dict[str, int] = {}

-        vocab_size: int = len(self.vocab)
-        expected_ids    = list(range(vocab_size, vocab_size + len(added_tokens)))
-        actual_ids      = sorted(added_tokens.values())
+        if (fname_tokenizer := base_path / 'vocab.json').exists():
+            # "slow" tokenizer
+            with open(fname_tokenizer, encoding="utf-8") as f:
+                self.vocab = json.load(f)
+
+            try:
+                # FIXME: Verify that added tokens here _cannot_ overlap with the main vocab.
+                with open(base_path / ADDED_TOKENS_FILE, encoding="utf-8") as f:
+                    added_tokens = json.load(f)
+            except FileNotFoundError:
+                pass
+        else:
+            # "fast" tokenizer
+            fname_tokenizer = base_path / FAST_TOKENIZER_FILE
+
+            # if this fails, FileNotFoundError propagates to caller
+            with open(fname_tokenizer, encoding="utf-8") as f:
+                tokenizer_json = json.load(f)
+
+            tokenizer_model: dict[str, Any] = tokenizer_json['model']
+            if (
+                tokenizer_model['type'] != 'BPE' or tokenizer_model.get('byte_fallback', False)
+                or tokenizer_json['decoder']['type'] != 'ByteLevel'
+            ):
+                raise FileNotFoundError('Cannot find GPT-2 BPE tokenizer')
+
+            self.vocab = tokenizer_model["vocab"]
+
+            if (added := tokenizer_json.get('added_tokens')) is not None:
+                # Added tokens here can be duplicates of the main vocabulary.
+                added_tokens = {item['content']: item['id']
+                                for item in added
+                                if item['content'] not in self.vocab}
+
+        vocab_size   = len(self.vocab)
+        expected_ids = list(range(vocab_size, vocab_size + len(added_tokens)))
+        actual_ids   = sorted(added_tokens.values())
        if expected_ids != actual_ids:
            expected_end_id = vocab_size + len(actual_ids) - 1
-            raise Exception(f"Expected the {len(actual_ids)} added token ID(s) to be sequential in the range {vocab_size} - {expected_end_id}; got {actual_ids}")
+            raise ValueError(f"Expected the {len(actual_ids)} added token ID(s) to be sequential in the range "
+                             f"{vocab_size} - {expected_end_id}; got {actual_ids}")

        items = sorted(added_tokens.items(), key=lambda text_idx: text_idx[1])
        self.added_tokens_dict    = added_tokens
        self.added_tokens_list    = [text for (text, idx) in items]
-        self.vocab_size_base: int = vocab_size
-        self.vocab_size: int      = self.vocab_size_base + len(self.added_tokens_list)
+        self.vocab_size_base      = vocab_size
+        self.vocab_size           = self.vocab_size_base + len(self.added_tokens_list)
        self.fname_tokenizer      = fname_tokenizer
-        self.fname_added_tokens   = fname_added_tokens

    def bpe_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
        reverse_vocab = {id: encoded_tok for encoded_tok, id in self.vocab.items()}
@@ -392,19 +441,25 @@ class BpeVocab:
        return f"<BpeVocab with {self.vocab_size_base} base tokens and {len(self.added_tokens_list)} added tokens>"


-class SentencePieceVocab:
+class SentencePieceVocab(Vocab):
    tokenizer_model = "llama"
    name = "spm"

-    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
-        self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
-        added_tokens: dict[str, int]
-        if fname_added_tokens is not None:
-            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
-        else:
-            added_tokens = {}
+    def __init__(self, base_path: Path):
+        added_tokens: dict[str, int] = {}
+        if (fname_tokenizer := base_path / 'tokenizer.model').exists():
+            # normal location
+            try:
+                with open(base_path / ADDED_TOKENS_FILE, encoding="utf-8") as f:
+                    added_tokens = json.load(f)
+            except FileNotFoundError:
+                pass
+        elif not (fname_tokenizer := base_path.parent / 'tokenizer.model').exists():
+            # not found in alternate location either
+            raise FileNotFoundError('Cannot find tokenizer.model')

-        vocab_size: int = self.sentencepiece_tokenizer.vocab_size()
+        self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
+        vocab_size = self.sentencepiece_tokenizer.vocab_size()

        new_tokens       = {id: piece for piece, id in added_tokens.items() if id >= vocab_size}
        expected_new_ids = list(range(vocab_size, vocab_size + len(new_tokens)))
@@ -414,18 +469,17 @@ class SentencePieceVocab:
            raise ValueError(f"Expected new token IDs {expected_new_ids} to be sequential; got {actual_new_ids}")

        # Token pieces that were added to the base vocabulary.
-        self.added_tokens_dict = added_tokens
+        self.added_tokens_dict  = added_tokens
        self.added_tokens_list  = [new_tokens[id] for id in actual_new_ids]
        self.vocab_size_base    = vocab_size
        self.vocab_size         = self.vocab_size_base + len(self.added_tokens_list)
        self.fname_tokenizer    = fname_tokenizer
-        self.fname_added_tokens = fname_added_tokens

    def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
        tokenizer = self.sentencepiece_tokenizer
        for i in range(tokenizer.vocab_size()):
            piece = tokenizer.id_to_piece(i)
-            text: bytes = piece.encode("utf-8")
+            text         = piece.encode("utf-8")
            score: float = tokenizer.get_score(i)

            toktype = gguf.TokenType.NORMAL
@@ -458,27 +512,42 @@ class SentencePieceVocab:
        return f"<SentencePieceVocab with {self.vocab_size_base} base tokens and {len(self.added_tokens_list)} added tokens>"


-class HfVocab:
+class LlamaHfVocab(Vocab):
    tokenizer_model = "llama"
    name = "hfft"

-    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None = None) -> None:
+    def __init__(self, base_path: Path, ignore_nonllama: bool = False):
+        fname_tokenizer = base_path / FAST_TOKENIZER_FILE
+        # if this fails, FileNotFoundError propagates to caller
+        with open(fname_tokenizer, encoding='utf-8') as f:
+            tokenizer_json = json.load(f)
+
+        # pre-check so we know if we need transformers
+        tokenizer_model: dict[str, Any] = tokenizer_json['model']
+        if ignore_nonllama:
+            pass  # workaround incorrect use of this class for WordPiece
+        elif (
+            tokenizer_model['type'] != 'BPE' or not tokenizer_model.get('byte_fallback', False)
+            or tokenizer_json['decoder']['type'] != 'Sequence'
+        ):
+            raise FileNotFoundError('Cannot find Llama BPE tokenizer')
+
        try:
            from transformers import AutoTokenizer
        except ImportError as e:
            raise ImportError(
-                "To use HfVocab, please install the `transformers` package. "
+                "To use LlamaHfVocab, please install the `transformers` package. "
                "You can install it with `pip install transformers`."
            ) from e

-        print("fname_tokenizer:", fname_tokenizer)
        # Allow the tokenizer to default to slow or fast versions.
        # Explicitly set tokenizer to use local paths.
        self.tokenizer = AutoTokenizer.from_pretrained(
-            fname_tokenizer,
-            cache_dir=fname_tokenizer,
+            base_path,
+            cache_dir=base_path,
            local_files_only=True,
        )
+        assert self.tokenizer.is_fast  # assume tokenizer.json is used

        # Initialize lists and dictionaries for added tokens
        self.added_tokens_list = []
@@ -506,8 +575,7 @@ class HfVocab:
        self.vocab_size_base = self.tokenizer.vocab_size
        self.vocab_size      = self.vocab_size_base + len(self.added_tokens_list)

-        self.fname_tokenizer    = fname_tokenizer
-        self.fname_added_tokens = fname_added_tokens
+        self.fname_tokenizer = fname_tokenizer

    def hf_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
        reverse_vocab = {
@@ -559,18 +627,7 @@ class HfVocab:
        yield from self.added_tokens()

    def __repr__(self) -> str:
-        return f"<HfVocab with {self.vocab_size_base} base tokens and {len(self.added_tokens_list)} added tokens>"
-
-
-class NoVocab:
-    tokenizer_model = "no_vocab"
-    name = "no_vocab"
-
-    def __repr__(self) -> str:
-        return "<NoVocab for a model without integrated vocabulary>"
-
-
-Vocab: TypeAlias = "BpeVocab | SentencePieceVocab | HfVocab | NoVocab"
+        return f"<LlamaHfVocab with {self.vocab_size_base} base tokens and {len(self.added_tokens_list)} added tokens>"


 #
@@ -588,7 +645,7 @@ def permute(weights: NDArray, n_head: int, n_head_kv: int) -> NDArray:
            .reshape(weights.shape))


-class Tensor(metaclass=ABCMeta):
+class Tensor(ABC):
    data_type: DataType

    @abstractmethod
@@ -610,7 +667,7 @@ def bf16_to_fp32(bf16_arr: np.ndarray[Any, np.dtype[np.uint16]]) -> NDArray:


 class UnquantizedTensor(Tensor):
-    def __init__(self, ndarray: NDArray) -> None:
+    def __init__(self, ndarray: NDArray):
        assert isinstance(ndarray, np.ndarray)
        self.ndarray = ndarray
        self.data_type = NUMPY_TYPE_TO_DATA_TYPE[ndarray.dtype]
@@ -689,7 +746,7 @@ class ModelPlus:
    model: LazyModel
    paths: list[Path]  # Where this was read from.
    format: Literal['ggml', 'torch', 'safetensors', 'none']
-    vocab: Vocab | None  # For GGML models (which have vocab built in), the vocab.
+    vocab: BaseVocab | None  # For GGML models (which have vocab built in), the vocab.


 def merge_sharded(models: list[LazyModel]) -> LazyModel:
@@ -698,7 +755,7 @@ def merge_sharded(models: list[LazyModel]) -> LazyModel:
    names = {name: None for model in models for name in model}

    def convert(name: str) -> LazyTensor:
-        lazy_tensors: list[LazyTensor] = [model[name] for model in models]
+        lazy_tensors = [model[name] for model in models]
        if len(lazy_tensors) == 1:
            # only one file; don't go through this procedure since there might
            # be quantized tensors
@@ -719,7 +776,7 @@ def merge_sharded(models: list[LazyModel]) -> LazyModel:

        def load() -> UnquantizedTensor:
            ndarrays = [load_unquantized(tensor) for tensor in lazy_tensors]
-            concatenated: NDArray = np.concatenate(ndarrays, axis=axis)
+            concatenated = np.concatenate(ndarrays, axis=axis)
            return UnquantizedTensor(concatenated)
        description = 'concatenated[[' + '] | ['.join(lt.description for lt in lazy_tensors) + ']]'
        return LazyTensor(load, concatenated_shape, lazy_tensors[0].data_type, description)
@@ -807,10 +864,10 @@ class LazyUnpickler(pickle.Unpickler):

        def load(offset: int, elm_count: int) -> NDArray:
            dtype = data_type.dtype
-            fp = self.zip_file.open(info)
-            fp.seek(offset * dtype.itemsize)
-            size = elm_count * dtype.itemsize
-            data = fp.read(size)
+            with self.zip_file.open(info) as fp:
+                fp.seek(offset * dtype.itemsize)
+                size = elm_count * dtype.itemsize
+                data = fp.read(size)
            assert len(data) == size
            return np.frombuffer(data, dtype)
        description = f'storage data_type={data_type} path-in-zip={filename} path={self.zip_file.filename}'
@@ -831,7 +888,7 @@ class LazyUnpickler(pickle.Unpickler):
    def rebuild_from_type_v2(func, new_type, args, state):
        return func(*args)

-    CLASSES: dict[tuple[str, str], Any] = {
+    CLASSES = {
        # getattr used here as a workaround for mypy not being smart enough to determine
        # the staticmethods have a __func__ attribute.
        ('torch._tensor', '_rebuild_from_type_v2'): getattr(rebuild_from_type_v2, '__func__'),
@@ -890,7 +947,7 @@ def lazy_load_safetensors_file(fp: IO[bytes], path: Path) -> ModelPlus:
 def must_read(fp: IO[bytes], length: int) -> bytes:
    ret = fp.read(length)
    if len(ret) < length:
-        raise Exception("unexpectedly reached end of file")
+        raise EOFError("unexpectedly reached end of file")
    return ret


@@ -948,13 +1005,14 @@ def bounded_parallel_map(func: Callable[[In], Out], iterable: Iterable[In], conc
            yield result


-def check_vocab_size(params: Params, vocab: Vocab, pad_vocab: bool = False) -> None:
+def check_vocab_size(params: Params, vocab: BaseVocab, pad_vocab: bool = False) -> None:
    # Handle special case where the model's vocab size is not set
    if params.n_vocab == -1:
        raise ValueError(
-            f"The model's vocab size is set to -1 in params.json. Please update it manually.{f' Maybe {vocab.vocab_size}?' if hasattr(vocab, 'vocab_size') else ''}"
+            "The model's vocab size is set to -1 in params.json. Please update it manually."
+            + (f" Maybe {vocab.vocab_size}?" if isinstance(vocab, Vocab) else ""),
        )
-    if isinstance(vocab, NoVocab):
+    if not isinstance(vocab, Vocab):
        return  # model has no vocab

    # Check for a vocab size mismatch
@@ -979,11 +1037,11 @@ def check_vocab_size(params: Params, vocab: Vocab, pad_vocab: bool = False) -> N
    if vocab.vocab_size < params.n_vocab:
        msg += " Add the --pad-vocab option and try again."

-    raise Exception(msg)
+    raise ValueError(msg)


 class OutputFile:
-    def __init__(self, fname_out: Path, endianess:gguf.GGUFEndian = gguf.GGUFEndian.LITTLE) -> None:
+    def __init__(self, fname_out: Path, endianess:gguf.GGUFEndian = gguf.GGUFEndian.LITTLE):
        self.gguf = gguf.GGUFWriter(fname_out, gguf.MODEL_ARCH_NAMES[ARCH], endianess=endianess)

    def add_meta_arch(self, params: Params) -> None:
@@ -1034,8 +1092,6 @@ class OutputFile:
            self.gguf.add_file_type(params.ftype)

    def extract_vocabulary_from_model(self, vocab: Vocab) -> tuple[list[bytes], list[float], list[gguf.TokenType]]:
-        assert not isinstance(vocab, NoVocab)
-
        tokens = []
        scores = []
        toktypes = []
@@ -1135,7 +1191,7 @@ class OutputFile:

    @staticmethod
    def write_all(
-        fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyModel, vocab: Vocab, svocab: gguf.SpecialVocab,
+        fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyModel, vocab: BaseVocab, svocab: gguf.SpecialVocab,
        concurrency: int = DEFAULT_CONCURRENCY, endianess: gguf.GGUFEndian = gguf.GGUFEndian.LITTLE,
        pad_vocab: bool = False,
    ) -> None:
@@ -1145,11 +1201,11 @@ class OutputFile:

        # meta data
        of.add_meta_arch(params)
-        if isinstance(vocab, NoVocab):
-            of.gguf.add_tokenizer_model(vocab.tokenizer_model)
-        else:
+        if isinstance(vocab, Vocab):
            of.add_meta_vocab(vocab)
            of.add_meta_special_vocab(svocab)
+        else:  # NoVocab
+            of.gguf.add_tokenizer_model(vocab.tokenizer_model)

        # tensor info
        for name, lazy_tensor in model.items():
@@ -1176,7 +1232,7 @@ def pick_output_type(model: LazyModel, output_type_str: str | None) -> GGMLFileT

    name_to_type = {name: lazy_tensor.data_type for (name, lazy_tensor) in model.items()}

-    raise Exception(f"Unexpected combination of types: {name_to_type}")
+    raise ValueError(f"Unexpected combination of types: {name_to_type}")


 def convert_to_output_type(model: LazyModel, output_type: GGMLFileType) -> LazyModel:
@@ -1186,7 +1242,7 @@ def convert_to_output_type(model: LazyModel, output_type: GGMLFileType) -> LazyM

 def convert_model_names(model: LazyModel, params: Params, skip_unknown: bool) -> LazyModel:
    tmap = gguf.TensorNameMap(ARCH, params.n_layer)
-    should_skip: set[gguf.MODEL_TENSOR] = set(gguf.MODEL_TENSOR_SKIP.get(ARCH, []))
+    should_skip = set(gguf.MODEL_TENSOR_SKIP.get(ARCH, []))

    tmp = model

@@ -1213,8 +1269,7 @@ def convert_model_names(model: LazyModel, params: Params, skip_unknown: bool) ->
            if skip_unknown:
                print(f"Unexpected tensor name: {name} - skipping")
                continue
-            else:
-                raise Exception(f"Unexpected tensor name: {name}. Use --skip-unknown to ignore it (e.g. LLaVA)")
+            raise ValueError(f"Unexpected tensor name: {name}. Use --skip-unknown to ignore it (e.g. LLaVA)")

        if tensor_type in should_skip:
            print(f"skipping tensor {name_new}")
@@ -1231,7 +1286,7 @@ def nth_multifile_path(path: Path, n: int) -> Path | None:
    the nth path in the model.
    '''
    # Support the following patterns:
-    patterns: list[tuple[str, str]] = [
+    patterns = [
        # - x.00.pth, x.01.pth, etc.
        (r'\.[0-9]{2}\.pth$', f'.{n:02}.pth'),
        # - x-00001-of-00002.bin, x-00002-of-00002.bin, etc.
@@ -1277,9 +1332,9 @@ def load_some_model(path: Path) -> ModelPlus:
            globs = ["consolidated.00.pth", "pytorch_model-00001-of-*.bin", "*.pt", "pytorch_model.bin"]
            files = [file for glob in globs for file in path.glob(glob)]
        if not files:
-            raise Exception(f"Can't find model in directory {path}")
+            raise FileNotFoundError(f"Can't find model in directory {path}")
        if len(files) > 1:
-            raise Exception(f"Found multiple models in {path}, not sure which to pick: {files}")
+            raise ValueError(f"Found multiple models in {path}, not sure which to pick: {files}")
        path = files[0]

    paths = find_multifile_paths(path)
@@ -1293,36 +1348,14 @@ def load_some_model(path: Path) -> ModelPlus:


 class VocabFactory:
-    _FILES = {"spm": "tokenizer.model", "bpe": "vocab.json", "hfft": "tokenizer.json"}
+    _VOCAB_CLASSES: list[type[Vocab]] = [SentencePieceVocab, BpeVocab, LlamaHfVocab]

    def __init__(self, path: Path):
        self.path = path
-        self.file_paths = self._detect_files()
-        print(f"Found vocab files: {self.file_paths}")

-    def _detect_files(self) -> dict[str, Path | None]:
-        def locate(file: str) -> Path | None:
-            if (path := self.path / file).exists():
-                return path
-            if (path := self.path.parent / file).exists():
-                return path
-            return None
-
-        return {vt: locate(f) for vt, f in self._FILES.items()}
-
-    def _select_file(self, vocab_types: list[str]) -> tuple[str, Path]:
-        for vtype in vocab_types:
-            try:
-                path = self.file_paths[vtype]
-            except KeyError:
-                raise ValueError(f"Unsupported vocabulary type {vtype}") from None
-            if path is not None:
-                return vtype, path
-        raise FileNotFoundError(f"Could not find any of {[self._FILES[vt] for vt in vocab_types]}")
-
-    def _create_special_vocab(self, vocab: Vocab, model_parent_path: Path) -> gguf.SpecialVocab:
+    def _create_special_vocab(self, vocab: BaseVocab, model_parent_path: Path) -> gguf.SpecialVocab:
        load_merges = vocab.name == "bpe"
-        n_vocab = vocab.vocab_size if hasattr(vocab, "vocab_size") else None
+        n_vocab = vocab.vocab_size if isinstance(vocab, Vocab) else None
        return gguf.SpecialVocab(
            model_parent_path,
            load_merges=load_merges,
@@ -1331,27 +1364,29 @@ class VocabFactory:
        )

    def _create_vocab_by_path(self, vocab_types: list[str]) -> Vocab:
-        vocab_type, path = self._select_file(vocab_types)
-        print(f"Loading vocab file {path!r}, type {vocab_type!r}")
+        vocab_classes: dict[str, type[Vocab]] = {cls.name: cls for cls in self._VOCAB_CLASSES}
+        selected_vocabs: dict[str, type[Vocab]] = {}
+        for vtype in vocab_types:
+            try:
+                selected_vocabs[vtype] = vocab_classes[vtype]
+            except KeyError:
+                raise ValueError(f"Unsupported vocabulary type {vtype}") from None

-        added_tokens_path = path.parent / "added_tokens.json"
-        if vocab_type == "bpe":
-            return BpeVocab(
-                path, added_tokens_path if added_tokens_path.exists() else None
-            )
-        if vocab_type == "spm":
-            return SentencePieceVocab(
-                path, added_tokens_path if added_tokens_path.exists() else None
-            )
-        if vocab_type == "hfft":
-            return HfVocab(
-                path.parent, added_tokens_path if added_tokens_path.exists() else None
-            )
-        raise ValueError(vocab_type)
+        for vtype, cls in selected_vocabs.items():
+            try:
+                vocab = cls(self.path)
+                break
+            except FileNotFoundError:
+                pass  # ignore unavailable tokenizers
+        else:
+            raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")

-    def load_vocab(self, vocab_types: list[str], model_parent_path: Path) -> tuple[Vocab, gguf.SpecialVocab]:
-        vocab: Vocab
-        if len(vocab_types) == 1 and "no_vocab" in vocab_types:
+        print(f"Loaded vocab file {vocab.fname_tokenizer!r}, type {vocab.name!r}")
+        return vocab
+
+    def load_vocab(self, vocab_types: list[str] | None, model_parent_path: Path) -> tuple[BaseVocab, gguf.SpecialVocab]:
+        vocab: BaseVocab
+        if vocab_types is None:
            vocab = NoVocab()
        else:
            vocab = self._create_vocab_by_path(vocab_types)
@@ -1408,10 +1443,8 @@ def main(args_in: list[str] | None = None) -> None:
    parser.add_argument("--skip-unknown", action="store_true",    help="skip unknown tensor names instead of failing")

    args = parser.parse_args(args_in)
-    if args.no_vocab:
-        if args.vocab_only:
-            raise ValueError("no need to specify --vocab-only if using --no-vocab")
-        args.vocab_type = "no_vocab"
+    if args.no_vocab and args.vocab_only:
+        raise ValueError("--vocab-only does not make sense with --no-vocab")

    if args.dump_single:
        model_plus = lazy_load_file(args.model)
@@ -1433,10 +1466,12 @@ def main(args_in: list[str] | None = None) -> None:
    params = Params.load(model_plus)
    if params.n_ctx == -1:
        if args.ctx is None:
-            raise Exception("The model doesn't have a context size, and you didn't specify one with --ctx\n"
-                            "Please specify one with --ctx:\n"
-                            " - LLaMA v1: --ctx 2048\n"
-                            " - LLaMA v2: --ctx 4096\n")
+            msg = """\
+                The model doesn't have a context size, and you didn't specify one with --ctx
+                Please specify one with --ctx:
+                 - LLaMA v1: --ctx 2048
+                 - LLaMA v2: --ctx 4096"""
+            parser.error(textwrap.dedent(msg))
        params.n_ctx = args.ctx

    if args.outtype:
@@ -1451,9 +1486,11 @@ def main(args_in: list[str] | None = None) -> None:
    model_parent_path = model_plus.paths[0].parent
    vocab_path = Path(args.vocab_dir or args.model or model_parent_path)
    vocab_factory = VocabFactory(vocab_path)
-    vocab, special_vocab = vocab_factory.load_vocab(args.vocab_type.split(","), model_parent_path)
+    vocab_types = None if args.no_vocab else args.vocab_type.split(",")
+    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)

    if args.vocab_only:
+        assert isinstance(vocab, Vocab)
        if not args.outfile:
            raise ValueError("need --outfile if using --vocab-only")
        outfile = args.outfile
--- a/examples/gguf-split/gguf-split.cpp
+++ b/examples/gguf-split/gguf-split.cpp
@@ -28,9 +28,11 @@ enum split_operation : uint8_t {

 struct split_params {
    split_operation operation = SPLIT_OP_SPLIT;
+    size_t n_bytes_split = 0;
    int n_split_tensors = 128;
    std::string input;
    std::string output;
+    bool dry_run = false;
 };

 static void split_print_usage(const char * executable) {
@@ -41,15 +43,36 @@ static void split_print_usage(const char * executable) {
    printf("Apply a GGUF operation on IN to OUT.");
    printf("\n");
    printf("options:\n");
-    printf("  -h, --help            show this help message and exit\n");
-    printf("  --version             show version and build info\n");
-    printf("  --split               split GGUF to multiple GGUF (default)\n");
-    printf("  --split-max-tensors   max tensors in each split: default(%d)\n", default_params.n_split_tensors);
-    printf("  --merge               merge multiple GGUF to a single GGUF\n");
+    printf("  -h, --help              show this help message and exit\n");
+    printf("  --version               show version and build info\n");
+    printf("  --split                 split GGUF to multiple GGUF (enabled by default)\n");
+    printf("  --merge                 merge multiple GGUF to a single GGUF\n");
+    printf("  --split-max-tensors     max tensors in each split (default: %d)\n", default_params.n_split_tensors);
+    printf("  --split-max-size N(M|G) max size per split\n");
+    printf("  --dry-run               only print out a split plan and exit, without writing any new files\n");
    printf("\n");
 }

-static bool split_params_parse_ex(int argc, const char ** argv, split_params & params) {
+// return convert string, for example "128M" or "4G" to number of bytes
+static size_t split_str_to_n_bytes(std::string str) {
+    size_t n_bytes = 0;
+    int n;
+    if (str.back() == 'M') {
+        sscanf(str.c_str(), "%d", &n);
+        n_bytes = n * 1024 * 1024; // megabytes
+    } else if (str.back() == 'G') {
+        sscanf(str.c_str(), "%d", &n);
+        n_bytes = n * 1024 * 1024 * 1024; // gigabytes
+    } else {
+        throw std::invalid_argument("error: supported units are M (megabytes) or G (gigabytes), but got: " + std::string(1, str.back()));
+    }
+    if (n <= 0) {
+        throw std::invalid_argument("error: size must be a positive value");
+    }
+    return n_bytes;
+}
+
+static void split_params_parse_ex(int argc, const char ** argv, split_params & params) {
    std::string arg;
    const std::string arg_prefix = "--";
    bool invalid_param = false;
@@ -62,6 +85,8 @@ static bool split_params_parse_ex(int argc, const char ** argv, split_params & p
        }

        bool arg_found = false;
+        bool is_op_set = false;
+        bool is_mode_set = false;
        if (arg == "-h" || arg == "--help") {
            split_print_usage(argv[0]);
            exit(0);
@@ -71,23 +96,46 @@ static bool split_params_parse_ex(int argc, const char ** argv, split_params & p
            fprintf(stderr, "built with %s for %s\n", LLAMA_COMPILER, LLAMA_BUILD_TARGET);
            exit(0);
        }
+        if (arg == "--dry-run") {
+            arg_found = true;
+            params.dry_run = true;
+        }

+        if (is_op_set) {
+            throw std::invalid_argument("error: either --split or --merge can be specified, but not both");
+        }
        if (arg == "--merge") {
            arg_found = true;
+            is_op_set = true;
            params.operation = SPLIT_OP_MERGE;
        }
        if (arg == "--split") {
            arg_found = true;
+            is_op_set = true;
            params.operation = SPLIT_OP_SPLIT;
        }
+
+        if (is_mode_set) {
+            throw std::invalid_argument("error: either --split-max-tensors or --split-max-size can be specified, but not both");
+        }
        if (arg == "--split-max-tensors") {
            if (++arg_idx >= argc) {
                invalid_param = true;
                break;
            }
            arg_found = true;
+            is_mode_set = true;
            params.n_split_tensors = atoi(argv[arg_idx]);
        }
+        if (arg == "--split-max-size") {
+            if (++arg_idx >= argc) {
+                invalid_param = true;
+                break;
+            }
+            arg_found = true;
+            is_mode_set = true;
+            params.n_bytes_split = split_str_to_n_bytes(argv[arg_idx]);
+        }

        if (!arg_found) {
            throw std::invalid_argument("error: unknown argument: " + arg);
@@ -99,24 +147,17 @@ static bool split_params_parse_ex(int argc, const char ** argv, split_params & p
    }

    if (argc - arg_idx < 2) {
-        printf("%s: bad arguments\n", argv[0]);
-        split_print_usage(argv[0]);
-        return false;
+        throw std::invalid_argument("error: bad arguments");
    }

    params.input = argv[arg_idx++];
    params.output = argv[arg_idx++];
-
-    return true;
 }

 static bool split_params_parse(int argc, const char ** argv, split_params & params) {
    bool result = true;
    try {
-        if (!split_params_parse_ex(argc, argv, params)) {
-            split_print_usage(argv[0]);
-            exit(EXIT_FAILURE);
-        }
+        split_params_parse_ex(argc, argv, params);
    }
    catch (const std::invalid_argument & ex) {
        fprintf(stderr, "%s\n", ex.what());
@@ -140,15 +181,11 @@ struct split_strategy {
    struct ggml_context * ctx_meta = NULL;
    const int n_tensors;

-    const int n_split;
-    int i_split = 0;
+    // one ctx_out per one output file
+    std::vector<struct gguf_context *> ctx_outs;

-    int i_tensor = 0;
-
-    std::vector<uint8_t> read_data;
-
-    struct gguf_context * ctx_out;
-    std::ofstream fout;
+    // temporary buffer for reading in tensor data
+    std::vector<uint8_t> read_buf;

    split_strategy(const split_params & params,
            std::ifstream & f_input,
@@ -158,79 +195,141 @@ struct split_strategy {
        f_input(f_input),
        ctx_gguf(ctx_gguf),
        ctx_meta(ctx_meta),
-        n_tensors(gguf_get_n_tensors(ctx_gguf)),
-        n_split(std::ceil(1. * n_tensors / params.n_split_tensors)) {
+        n_tensors(gguf_get_n_tensors(ctx_gguf)) {
+
+        // because we need to know list of tensors for each file in advance, we will build all the ctx_out for all output splits
+        int i_split = -1;
+        struct gguf_context * ctx_out = NULL;
+        auto new_ctx_out = [&]() {
+            i_split++;
+            if (ctx_out != NULL) {
+                if (gguf_get_n_tensors(ctx_out) == 0) {
+                    fprintf(stderr, "error: one of splits have 0 tensors. Maybe size or tensors limit is too small\n");
+                    exit(EXIT_FAILURE);
+                }
+                ctx_outs.push_back(ctx_out);
+            }
+            ctx_out = gguf_init_empty();
+            // Save all metadata in first split only
+            if (i_split == 0) {
+                gguf_set_kv(ctx_out, ctx_gguf);
+            }
+            gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_NO, i_split);
+            gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_COUNT, 0); // placeholder
+            gguf_set_val_i32(ctx_out, LLM_KV_SPLIT_TENSORS_COUNT, n_tensors);
+        };
+
+        // initialize ctx_out for the first split
+        new_ctx_out();
+
+        // process tensors one by one
+        size_t curr_tensors_size = 0; // current size by counting only tensors size (without metadata)
+        for (int i = 0; i < n_tensors; ++i) {
+            struct ggml_tensor * t = ggml_get_tensor(ctx_meta, gguf_get_tensor_name(ctx_gguf, i));
+            // calculate the "imaginary" size = the current size + next tensor size
+            size_t n_bytes = GGML_PAD(ggml_nbytes(t), GGUF_DEFAULT_ALIGNMENT);
+            size_t next_tensors_size = curr_tensors_size + n_bytes;
+            if (should_split(i, next_tensors_size)) {
+                new_ctx_out();
+                curr_tensors_size = n_bytes;
+            } else {
+                curr_tensors_size = next_tensors_size;
+            }
+            gguf_add_tensor(ctx_out, t);
        }

-    bool should_split() const {
-        return i_tensor < n_tensors && i_tensor % params.n_split_tensors == 0;
+        // push the last ctx_out
+        ctx_outs.push_back(ctx_out);
+
+        // set the correct n_split for all ctx_out
+        for (auto & ctx : ctx_outs) {
+            gguf_set_val_u16(ctx, LLM_KV_SPLIT_COUNT, ctx_outs.size());
+        }
    }

-    void split_start() {
-        ctx_out = gguf_init_empty();
-
-        // Save all metadata in first split only
-        if (i_split == 0) {
-            gguf_set_kv(ctx_out, ctx_gguf);
+    ~split_strategy() {
+        for (auto & ctx_out : ctx_outs) {
+            gguf_free(ctx_out);
        }
-        gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_NO, i_split);
-        gguf_set_val_u16(ctx_out, LLM_KV_SPLIT_COUNT, n_split);
-        gguf_set_val_i32(ctx_out, LLM_KV_SPLIT_TENSORS_COUNT, n_tensors);
-
-        // populate the original tensors, so we get an initial metadata
-        for (int i = i_split * params.n_split_tensors; i < n_tensors && i < (i_split + 1) * params.n_split_tensors; ++i) {
-            struct ggml_tensor * meta = ggml_get_tensor(ctx_meta, gguf_get_tensor_name(ctx_gguf, i));
-            gguf_add_tensor(ctx_out, meta);
-        }
-
-        char split_path[PATH_MAX] = {0};
-        llama_split_path(split_path, sizeof(split_path), params.output.c_str(), i_split, n_split);
-
-        fprintf(stderr, "%s: %s ...", __func__, split_path);
-        fout = std::ofstream(split_path, std::ios::binary);
-        fout.exceptions(std::ofstream::failbit); // fail fast on write errors
-
-        auto meta_size = gguf_get_meta_size(ctx_out);
-
-        // placeholder for the meta data
-        ::zeros(fout, meta_size);
-
-        i_split++;
    }

-    void next_tensor() {
-        const char * t_name = gguf_get_tensor_name(ctx_gguf, i_tensor);
-        struct ggml_tensor * t = ggml_get_tensor(ctx_meta, t_name);
-        auto n_bytes = ggml_nbytes(t);
-
-        if (read_data.size() < n_bytes) {
-            read_data.resize(n_bytes);
+    bool should_split(int i_tensor, size_t next_size) {
+        if (params.n_bytes_split > 0) {
+            // split by max size per file
+            return next_size > params.n_bytes_split;
+        } else {
+            // split by number of tensors per file
+            return i_tensor > 0 && i_tensor < n_tensors && i_tensor % params.n_split_tensors == 0;
        }
-
-        auto offset = gguf_get_data_offset(ctx_gguf) + gguf_get_tensor_offset(ctx_gguf, i_tensor);
-        f_input.seekg(offset);
-        f_input.read((char *)read_data.data(), n_bytes);
-
-        t->data = read_data.data();
-
-        // write tensor data + padding
-        fout.write((const char *)t->data, n_bytes);
-        zeros(fout, GGML_PAD(n_bytes, GGUF_DEFAULT_ALIGNMENT) - n_bytes);
-
-        i_tensor++;
    }

-    void split_end() {
-        // go back to beginning of file and write the updated metadata
-        fout.seekp(0);
-        std::vector<uint8_t> data(gguf_get_meta_size(ctx_out));
-        gguf_get_meta_data(ctx_out, data.data());
-        fout.write((const char *)data.data(), data.size());
+    void print_info() {
+        printf("n_split: %ld\n", ctx_outs.size());
+        int i_split = 0;
+        for (auto & ctx_out : ctx_outs) {
+            // re-calculate the real gguf size for each split (= metadata size + total size of all tensors)
+            size_t total_size = gguf_get_meta_size(ctx_out);
+            for (int i = 0; i < gguf_get_n_tensors(ctx_out); ++i) {
+                struct ggml_tensor * t = ggml_get_tensor(ctx_meta, gguf_get_tensor_name(ctx_out, i));
+                total_size += ggml_nbytes(t);
+            }
+            total_size = total_size / 1024 / 1024; // convert to megabytes
+            printf("split %05d: n_tensors = %d, total_size = %ldM\n", i_split + 1, gguf_get_n_tensors(ctx_out), total_size);
+            i_split++;
+        }
+    }

-        fout.close();
-        gguf_free(ctx_out);
+    void write() {
+        int i_split = 0;
+        int n_split = ctx_outs.size();
+        for (auto & ctx_out : ctx_outs) {
+            // construct file path
+            char split_path[PATH_MAX] = {0};
+            llama_split_path(split_path, sizeof(split_path), params.output.c_str(), i_split, n_split);

-        fprintf(stderr, "\033[3Ddone\n");
+            // open the output file
+            printf("Writing file %s ... ", split_path);
+            fflush(stdout);
+            std::ofstream fout = std::ofstream(split_path, std::ios::binary);
+            fout.exceptions(std::ofstream::failbit); // fail fast on write errors
+
+            // write metadata
+            std::vector<uint8_t> data(gguf_get_meta_size(ctx_out));
+            gguf_get_meta_data(ctx_out, data.data());
+            fout.write((const char *)data.data(), data.size());
+
+            // write tensors
+            for (int i = 0; i < gguf_get_n_tensors(ctx_out); ++i) {
+                // read tensor meta and prepare buffer
+                const char * t_name = gguf_get_tensor_name(ctx_out, i);
+                struct ggml_tensor * t = ggml_get_tensor(ctx_meta, t_name);
+                auto n_bytes = ggml_nbytes(t);
+                read_buf.resize(n_bytes);
+
+                // calculate offset
+                auto i_tensor_in = gguf_find_tensor(ctx_gguf, t_name); // idx of tensor in the input file
+                auto offset = gguf_get_data_offset(ctx_gguf) + gguf_get_tensor_offset(ctx_gguf, i_tensor_in);
+
+                // copy tensor from input to output file
+                copy_file_to_file(f_input, fout, offset, n_bytes);
+                zeros(fout, GGML_PAD(n_bytes, GGUF_DEFAULT_ALIGNMENT) - n_bytes);
+            }
+
+            printf("done\n");
+            // close the file
+            fout.close();
+            i_split++;
+        }
+    }
+
+    void copy_file_to_file(std::ifstream & f_in, std::ofstream & f_out, const size_t in_offset, const size_t len) {
+        // TODO: detect OS and use copy_file_range() here for better performance
+        if (read_buf.size() < len) {
+            read_buf.resize(len);
+        }
+        f_in.seekg(in_offset);
+        f_in.read((char *)read_buf.data(), len);
+        f_out.write((const char *)read_buf.data(), len);
    }
 };

@@ -254,32 +353,22 @@ static void gguf_split(const split_params & split_params) {
        exit(EXIT_FAILURE);
    }

+    // prepare the strategy
    split_strategy strategy(split_params, f_input, ctx_gguf, ctx_meta);
+    int n_split = strategy.ctx_outs.size();
+    strategy.print_info();

-    char first_split_path[PATH_MAX] = {0};
-    llama_split_path(first_split_path, sizeof(first_split_path),
-                     split_params.output.c_str(), strategy.i_split, strategy.n_split);
-    fprintf(stderr, "%s: %s -> %s (%d tensors per file)\n",
-            __func__, split_params.input.c_str(),
-            first_split_path,
-            split_params.n_split_tensors);
-
-    strategy.split_start();
-
-    while (strategy.i_tensor < strategy.n_tensors) {
-        strategy.next_tensor();
-        if (strategy.should_split()) {
-            strategy.split_end();
-            strategy.split_start();
-        }
+    if (!split_params.dry_run) {
+        // write all output splits
+        strategy.write();
    }
-    strategy.split_end();

+    // done, clean up
    gguf_free(ctx_gguf);
    f_input.close();

    fprintf(stderr, "%s: %d gguf split written with a total of %d tensors.\n",
-            __func__, strategy.n_split, strategy.n_tensors);
+            __func__, n_split, strategy.n_tensors);
 }

 static void gguf_merge(const split_params & split_params) {
@@ -448,10 +537,6 @@ static void gguf_merge(const split_params & split_params) {
 }

 int main(int argc, const char ** argv) {
-    if (argc < 3) {
-        split_print_usage(argv[0]);
-    }
-
    split_params params;
    split_params_parse(argc, argv, params);

--- a/examples/llava/MobileVLM-README.md
+++ b/examples/llava/MobileVLM-README.md
@@ -6,7 +6,7 @@ for more information, please go to [Meituan-AutoML/MobileVLM](https://github.com

 The implementation is based on llava, and is compatible with llava and mobileVLM. The usage is basically same as llava.

-Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion  is a little different. Therefore, using MobileVLM as an example, the different conversion step will be shown.
+Notice: The overall process of model inference for both **MobileVLM** and **MobileVLM_V2** models is the same, but the process of model conversion is a little different. Therefore, using **MobileVLM-1.7B** as an example, the different conversion step will be shown.

 ## Usage
 Build with cmake or run `make llava-cli` to build it.
@@ -36,7 +36,7 @@ git clone https://huggingface.co/openai/clip-vit-large-patch14-336
 python ./examples/llava/llava-surgery.py -m path/to/MobileVLM-1.7B
 ```

-3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** the arg is `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:
+3. Use `convert-image-encoder-to-gguf.py` with `--projector-type ldp` (for **V2** please use `--projector-type ldpv2`) to convert the LLaVA image encoder to GGUF:

 ```sh
 python ./examples/llava/convert-image-encoder-to-gguf \
@@ -78,7 +78,7 @@ cd examples/llava/android/build_64
 ### run on Android
 refer to `android/adb_run.sh`, modify resources' `name` and `path`

-## some result on Android with `Snapdragon 888` chip
+## Some result on Android with `Snapdragon 888` chip
 ### case 1
 **input**
 ```sh
@@ -109,7 +109,6 @@ llama_print_timings:       total time =   34731.93 ms
    --image /data/local/tmp/cat.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"
 ```
-
 **output**
 ```sh
 encode_image_with_clip: image encoded in 21149.51 ms by CLIP (  146.87 ms per image patch)
@@ -121,12 +120,82 @@ llama_print_timings:        eval time =    1279.03 ms /    18 runs   (   71.06 m
 llama_print_timings:       total time =   34570.79 ms
 ```

+
+## Some result on Android with `Snapdragon 778G` chip
+### MobileVLM-1.7B case
+#### llava-cli release-b2005
+**input**
+```sh
+/data/local/tmp/llava-cli \
+    -m /data/local/tmp/ggml-model-q4_k.gguf \
+    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
+    -t 4 \
+    --image /data/local/tmp/many_llamas.jpeg \
+    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:"
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in 18728.52 ms by CLIP (  130.06 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ A group of llamas are standing in a green pasture.
+
+llama_print_timings:        load time =   20357.33 ms
+llama_print_timings:      sample time =       2.96 ms /    14 runs   (    0.21 ms per token,  4734.53 tokens per second)
+llama_print_timings: prompt eval time =    8119.49 ms /   191 tokens (   42.51 ms per token,    23.52 tokens per second)
+llama_print_timings:        eval time =    1005.75 ms /    14 runs   (   71.84 ms per token,    13.92 tokens per second)
+llama_print_timings:       total time =   28038.34 ms /   205 tokens
+```
+#### llava-cli latest-version
+**input**
+
+Just the same as above.
+
+**output**(seems to be much slower)
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ It is a group of sheep standing together in a grass field.
+
+llama_print_timings:        load time =  818120.91 ms
+llama_print_timings:      sample time =       3.44 ms /    14 runs   (    0.25 ms per token,  4067.40 tokens per second)
+llama_print_timings: prompt eval time =  529274.69 ms /   191 tokens ( 2771.07 ms per token,     0.36 tokens per second)
+llama_print_timings:        eval time =   43894.02 ms /    13 runs   ( 3376.46 ms per token,     0.30 tokens per second)
+llama_print_timings:       total time =  865441.76 ms /   204 tokens
+```
+### MobileVLM_V2-1.7B case
+#### llava-cli release-2005b
+**input**
+
+Just the same as above.
+
+**output**
+```sh
+encode_image_with_clip: image encoded in 20609.61 ms by CLIP (  143.12 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting.
+
+The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama
+
+llama_print_timings:        load time =   22406.77 ms
+llama_print_timings:      sample time =      49.26 ms /   186 runs   (    0.26 ms per token,  3776.27 tokens per second)
+llama_print_timings: prompt eval time =    9044.54 ms /   191 tokens (   47.35 ms per token,    21.12 tokens per second)
+llama_print_timings:        eval time =   14497.49 ms /   186 runs   (   77.94 ms per token,    12.83 tokens per second)
+llama_print_timings:       total time =   44411.01 ms /   377 tokens
+```
+
 ## Orin compile and run
 ### compile
 ```sh
 make LLAMA_CUDA=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 32
 ```
-
 ### run on Orin
 ### case 1
 **input**
@@ -175,8 +244,121 @@ llama_print_timings:        eval time =     166.65 ms /    11 runs   (   15.15 m
 llama_print_timings:       total time =    1365.47 ms /   243 tokens
 ```

-## Minor shortcomings
-The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quickly, we uniformly modified `clip_n_patches` function to a quarter. when counting the time consumption, the calculated time will be 4 times bigger than the real cost.
+## Running on Intel(R) Core(TM) i7-10750H
+### Operating system
+Ubuntu22.04
+### compile
+```sh
+make -j32
+```
+### MobileVLM-1.7B case
+**input**
+```sh
+-m /path/to/ggml-model-q4_k.gguf \
+    --mmproj /path/to/mmproj-model-f16.gguf \
+    --image /path/to/many_llamas.jpeg
+    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
+```
+**output**
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in  2730.94 ms by CLIP (   18.96 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that?ASSISTANT:
+
+ A group of llamas are walking together in a field.
+
+llama_print_timings:        load time =    5506.60 ms
+llama_print_timings:      sample time =       0.44 ms /    13 runs   (    0.03 ms per token, 29545.45 tokens per second)
+llama_print_timings: prompt eval time =    2031.58 ms /   190 tokens (   10.69 ms per token,    93.52 tokens per second)
+llama_print_timings:        eval time =     438.92 ms /    12 runs   (   36.58 ms per token,    27.34 tokens per second)
+llama_print_timings:       total time =    5990.25 ms /   202 tokens
+```
+
+### MobileVLM_V2-1.7B case
+**input**
+
+Just the same as above.
+
+**ouput**
+```sh
+encode_image_with_clip: image embedding created: 144 tokens
+
+encode_image_with_clip: image encoded in  3223.89 ms by CLIP (   22.39 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that?ASSISTANT:
+
+ The image captures a tranquil scene in a park, where a group of approximately 20 llamas are gathered. The llamas, a mix of white and black, are standing in a line, their black and white patterns contrasting with the lush green grass of the park. The lamas are arranged in a line, suggesting a social order.
+
+The park itself is lush and green, with trees dotting the landscape in the background. A sign reading "Llamas Tico  Ana" is also visible in the image, possibly indicating the location or the breed of the llamas. The image seems to be taken from a distance, providing a wide view of the scene and the surrounding environment.
+
+The llamas' positions relative to each other, the sign, and the trees create a harmonious composition. The image does not contain any discernible text. The overall scene is one of peace and natural beauty, with the llamas in their natural habitat, surrounded by the vibrant colors and lush greenery of the park.
+
+llama_print_timings:        load time =    6642.61 ms
+llama_print_timings:      sample time =       8.15 ms /   223 runs   (    0.04 ms per token, 27358.61 tokens per second)
+llama_print_timings: prompt eval time =    2475.07 ms /   190 tokens (   13.03 ms per token,    76.77 tokens per second)
+llama_print_timings:        eval time =    8760.60 ms /   222 runs   (   39.46 ms per token,    25.34 tokens per second)
+llama_print_timings:       total time =   15513.95 ms /   412 tokens
+```
+
+## Run on Intel(R) Core(TM) Ultra7 115H
+### operation system
+Windows11
+### comiple
+```sh
+make -j32
+```
+### MobileVLM-1.7B case
+**input**
+```sh
+-m /path/to/ggml-model-q4_k.gguf \
+    --mmproj /path/to/tmp/mmproj-model-f16.gguf \
+    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \
+```
+**output**
+```sh
+encode_image_with_clip: image encoded in  4902.81 ms by CLIP (   34.05 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ The image features a group of brown and white llamas standing in a grassy field.
+
+llama_print_timings:        load time =    7441.06 ms
+llama_print_timings:      sample time =       0.72 ms /    19 runs   (    0.04 ms per token, 26279.39 tokens per second)
+llama_print_timings: prompt eval time =    2090.71 ms /   191 tokens (   10.95 ms per token,    91.36 tokens per second)
+llama_print_timings:        eval time =     512.35 ms /    18 runs   (   28.46 ms per token,    35.13 tokens per second)
+llama_print_timings:       total time =    7987.23 ms /   209 tokens
+```
+
+### MobileVLM_V2-1.7B case
+**input**
+
+Just the same as above.
+
+**output**
+```sh
+encode_image_with_clip: image encoded in  4682.44 ms by CLIP (   32.52 ms per image patch)
+system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
+user_prompt: \nWhat's that? ASSISTANT:
+
+ This image captures a lively scene of a group of 14 llamas in a grassy field. The llamas, with their distinctive black and white coats, are standing and walking in a line, seemingly engaged in a social activity. One
+ of them, possibly the first in the line, has its back turned, perhaps observing something in the distance.
+
+The llama in the front of the line stands out due to its black and white coloring, which is quite unusual for llama patterns. The llama in the front also seems to be more aware of its surroundings, as it faces the camera, giving a sense of engagement with the viewer.
+
+The image is taken from the side of the llama, providing a clear view of the llama in the front and its companions. The lameness in the llama in
+ front is not visible, indicating that it might not be the main focus of the photo.
+
+The background of the image features a grassy field, with a fence and a tree visible in the distance. The tree appears to be bare, suggesting that it might be during a time of year when most trees are dormant or have shed their leaves.
+
+
+llama_print_timings:        load time =    7015.35 ms
+llama_print_timings:      sample time =      10.61 ms /   256 runs   (    0.04 ms per token, 24119.09 tokens per second)
+llama_print_timings: prompt eval time =    2052.45 ms /   191 tokens (   10.75 ms per token,    93.06 tokens per second)
+llama_print_timings:        eval time =    7259.43 ms /   255 runs   (   28.47 ms per token,    35.13 tokens per second)
+llama_print_timings:       total time =   14371.19 ms /   446 tokens
+```

 ## TODO

@@ -191,5 +373,5 @@ The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quic

 ## contributor
 ```sh
-zhangjidong05, yangyang260, huyiming03, chenxiaotao03
+zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77
 ```
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -835,9 +835,10 @@ static ggml_cgraph * clip_image_build_graph(clip_ctx * ctx, const clip_image_f32
            mlp_2 = ggml_pool_2d(ctx0, mlp_2, GGML_OP_POOL_AVG, 2, 2, 2, 2, 0, 0);
            // weight ne = [3, 3, 2048, 1]
            struct ggml_tensor * peg_0 = ggml_conv_depthwise_2d(ctx0, model.mm_model_peg_0_w, mlp_2, 1, 1, 1, 1, 1, 1);
-            peg_0 = ggml_add(ctx0, peg_0, mlp_2);
            peg_0 = ggml_cont(ctx0, ggml_permute(ctx0, peg_0, 1, 2, 0, 3));
            peg_0 = ggml_add(ctx0, peg_0, model.mm_model_peg_0_b);
+            mlp_2 = ggml_cont(ctx0, ggml_permute(ctx0, mlp_2, 1, 2, 0, 3));
+            peg_0 = ggml_add(ctx0, peg_0, mlp_2);
            peg_0 = ggml_reshape_3d(ctx0, peg_0, peg_0->ne[0], peg_0->ne[1] * peg_0->ne[2], peg_0->ne[3]);
            embeddings = peg_0;
        }
@@ -1755,7 +1756,7 @@ int clip_n_patches(const struct clip_ctx * ctx) {

    int n_patches = (params.image_size / params.patch_size) * (params.image_size / params.patch_size);

-    if (ctx->proj_type == PROJECTOR_TYPE_LDP) {
+    if (ctx->proj_type == PROJECTOR_TYPE_LDP || ctx->proj_type == PROJECTOR_TYPE_LDPV2) {
        n_patches /= 4;
    }

--- a/ggml-alloc.c
+++ b/ggml-alloc.c
@@ -705,8 +705,13 @@ bool ggml_gallocr_reserve_n(ggml_gallocr_t galloc, struct ggml_cgraph * graph, c
        struct ggml_tensor * leaf = graph->leafs[i];
        struct hash_node * hn = ggml_gallocr_hash_get(galloc, leaf);
        galloc->leaf_allocs[i].buffer_id = hn->buffer_id;
-        galloc->leaf_allocs[i].leaf.offset = hn->offset;
-        galloc->leaf_allocs[i].leaf.size_max = ggml_backend_buft_get_alloc_size(galloc->bufts[hn->buffer_id], leaf);
+        if (leaf->view_src || leaf->data) {
+            galloc->leaf_allocs[i].leaf.offset = SIZE_MAX;
+            galloc->leaf_allocs[i].leaf.size_max = 0;
+        } else {
+            galloc->leaf_allocs[i].leaf.offset = hn->offset;
+            galloc->leaf_allocs[i].leaf.size_max = ggml_backend_buft_get_alloc_size(galloc->bufts[hn->buffer_id], leaf);
+        }
    }

    // reallocate buffers if needed
--- a/ggml-cuda/common.cuh
+++ b/ggml-cuda/common.cuh
@@ -1,7 +1,8 @@
 #pragma once

-#include "../ggml.h"
-#include "../ggml-cuda.h"
+#include "ggml.h"
+#include "ggml-cuda.h"
+
 #include <memory>

 #if defined(GGML_USE_HIPBLAS)
@@ -11,7 +12,7 @@
 #define GGML_COMMON_DECL_CUDA
 #define GGML_COMMON_IMPL_CUDA
 #endif
-#include "../ggml-common.h"
+#include "ggml-common.h"

 #include <cstdio>
 #include <array>
--- a/ggml-cuda/dmmv.cu
+++ b/ggml-cuda/dmmv.cu
@@ -2,14 +2,6 @@
 #include "dequantize.cuh"
 #include "convert.cuh"

-// dmmv = dequantize_mul_mat_vec
-#ifndef GGML_CUDA_DMMV_X
-#define GGML_CUDA_DMMV_X 32
-#endif
-#ifndef GGML_CUDA_MMV_Y
-#define GGML_CUDA_MMV_Y 1
-#endif
-
 #ifndef K_QUANTS_PER_ITERATION
 #define K_QUANTS_PER_ITERATION 2
 #else
--- a/ggml-cuda/dmmv.cuh
+++ b/ggml-cuda/dmmv.cuh
@@ -1,5 +1,16 @@
 #include "common.cuh"

+// dmmv = dequantize_mul_mat_vec
+
+// TODO: remove this?
+#ifndef GGML_CUDA_DMMV_X
+#define GGML_CUDA_DMMV_X 32
+#endif
+
+#ifndef GGML_CUDA_MMV_Y
+#define GGML_CUDA_MMV_Y 1
+#endif
+
 void ggml_cuda_op_dequantize_mul_mat_vec(
    ggml_backend_cuda_context & ctx,
    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i,
--- a/ggml-vulkan-shaders.hpp
+++ b/ggml-vulkan-shaders.hpp
--- a/ggml-vulkan.cpp
+++ b/ggml-vulkan.cpp
--- a/ggml-vulkan.h
+++ b/ggml-vulkan.h
@@ -11,17 +11,6 @@ extern "C" {
 #define GGML_VK_MAX_DEVICES 16

 GGML_API void ggml_vk_instance_init(void);
-GGML_API void ggml_vk_init_cpu_assist(void);
-
-GGML_API void ggml_vk_preallocate_buffers_graph_cpu_assist(struct ggml_tensor * node);
-GGML_API void ggml_vk_preallocate_buffers_cpu_assist(void);
-GGML_API void ggml_vk_build_graph_cpu_assist(struct ggml_tensor * node, bool last_node);
-GGML_API bool ggml_vk_compute_forward_cpu_assist(struct ggml_compute_params * params, struct ggml_tensor * tensor);
-#ifdef GGML_VULKAN_CHECK_RESULTS
-void ggml_vk_check_results_1_cpu_assist(struct ggml_compute_params * params, struct ggml_tensor * tensor);
-#endif
-GGML_API void ggml_vk_graph_cleanup_cpu_assist(void);
-GGML_API void ggml_vk_free_cpu_assist(void);

 // backend API
 GGML_API GGML_CALL ggml_backend_t ggml_backend_vk_init(size_t dev_num);
--- a/ggml.c
+++ b/ggml.c
@@ -278,8 +278,6 @@ inline static void * ggml_calloc(size_t num, size_t size) {
 #include <Accelerate/Accelerate.h>
 #if defined(GGML_USE_CLBLAST) // allow usage of CLBlast alongside Accelerate functions
 #include "ggml-opencl.h"
-#elif defined(GGML_USE_VULKAN)
-#include "ggml-vulkan.h"
 #endif
 #elif defined(GGML_USE_OPENBLAS)
 #if defined(GGML_BLAS_USE_MKL)
@@ -289,8 +287,6 @@ inline static void * ggml_calloc(size_t num, size_t size) {
 #endif
 #elif defined(GGML_USE_CLBLAST)
 #include "ggml-opencl.h"
-#elif defined(GGML_USE_VULKAN)
-#include "ggml-vulkan.h"
 #endif

 // floating point type used to accumulate sums
@@ -2717,8 +2713,6 @@ struct ggml_context * ggml_init(struct ggml_init_params params) {

 #if defined(GGML_USE_CLBLAST)
        ggml_cl_init();
-#elif defined(GGML_USE_VULKAN)
-        ggml_vk_init_cpu_assist();
 #endif

        ggml_setup_op_has_task_pass();
@@ -16128,20 +16122,6 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
        return;
    }

-#if defined(GGML_USE_VULKAN)
-    const bool skip_cpu = ggml_vk_compute_forward_cpu_assist(params, tensor);
-#ifdef GGML_VULKAN_CHECK_RESULTS
-    if (skip_cpu) {
-        ggml_vk_check_results_1_cpu_assist(params, tensor);
-    }
-#endif
-    if (skip_cpu) {
-        return;
-    }
-    GGML_ASSERT(tensor->src[0] == NULL || tensor->src[0]->backend == GGML_BACKEND_TYPE_CPU);
-    GGML_ASSERT(tensor->src[1] == NULL || tensor->src[1]->backend == GGML_BACKEND_TYPE_CPU);
-#endif // GGML_USE_VULKAN
-
    switch (tensor->op) {
        case GGML_OP_DUP:
            {
@@ -18617,17 +18597,6 @@ enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cpl
        }
    }

-#ifdef GGML_USE_VULKAN
-    for (int i = 0; i < cgraph->n_nodes; i++) {
-        ggml_vk_preallocate_buffers_graph_cpu_assist(cgraph->nodes[i]);
-    }
-    ggml_vk_preallocate_buffers_cpu_assist();
-
-    for (int i = 0; i < cgraph->n_nodes; i++) {
-        ggml_vk_build_graph_cpu_assist(cgraph->nodes[i], i == cgraph->n_nodes - 1);
-    }
-#endif
-
    const int n_threads = cplan->n_threads;

    struct ggml_compute_state_shared state_shared = {
@@ -18684,10 +18653,6 @@ enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cpl
        }
    }

-#ifdef GGML_USE_VULKAN
-    ggml_vk_graph_cleanup_cpu_assist();
-#endif
-
    // performance stats (graph)
    {
        int64_t perf_cycles_cur  = ggml_perf_cycles()  - perf_start_cycles;
--- a/ggml_vk_generate_shaders.py
+++ b/ggml_vk_generate_shaders.py
@@ -18,6 +18,12 @@ shader_int8_ext = """
 """

 # Type-specific defines
+shader_f32_defines = """
+#define QUANT_K 1
+#define QUANT_R 1
+
+#define A_TYPE float
+"""
 shader_f16_defines = """
 #define QUANT_K 1
 #define QUANT_R 1
@@ -157,8 +163,8 @@ struct block_q6_K
 """

 # Dequant functions
-shader_f16_dequant_func = """
-#define DEQUANT_FUNC vec2 v = vec2(data_a[ib + 0], data_a[ib + 1]);
+shader_float_dequant_func = """
+#define DEQUANT_FUNC vec2 v = vec2(ib, ib);  // data_a[ib], data_a[ib + 1]);
 """

 shader_q4_0_dequant_func = """
@@ -410,6 +416,133 @@ mulmat_load_q8_0 = """
            buf_a[buf_idx    ] = FLOAT_TYPE(v.x);
            buf_a[buf_idx + 1] = FLOAT_TYPE(v.y);"""

+
+mulmat_load_q2_K = """
+            const uint idx = pos_a + (loadc_a + l) * p.stride_a / LOAD_VEC_A + loadr_a;
+            const uint buf_idx = (loadc_a + l) * (BK+1) + loadr_a * LOAD_VEC_A;
+
+            const uint ib = idx / 128;                         // 2 values per idx
+            const uint iqs = idx % 128;                        // 0..127
+
+            const uint qsi = (iqs / 64) * 32 + (iqs % 16) * 2; // 0,2,4..30
+            const uint scalesi = iqs / 8;                      // 0..15
+            const uint qsshift = ((iqs % 64) / 16) * 2;        // 0,2,4,6
+
+            const uvec2 qs = uvec2(data_a[ib].qs[qsi], data_a[ib].qs[qsi + 1]);
+            const uint scales = data_a[ib].scales[scalesi];
+            const vec2 d = vec2(data_a[ib].d);
+
+            const vec2 v = d.x * float(scales & 0xF) * vec2((qs >> qsshift) & 3) - d.y * float(scales >> 4);
+
+            buf_a[buf_idx    ] = FLOAT_TYPE(v.x);
+            buf_a[buf_idx + 1] = FLOAT_TYPE(v.y);"""
+
+mulmat_load_q3_K = """
+            const uint idx = pos_a + (loadc_a + l) * p.stride_a / LOAD_VEC_A + loadr_a;
+            const uint buf_idx = (loadc_a + l) * (BK+1) + loadr_a * LOAD_VEC_A;
+
+            const uint ib = idx / 128;                   // 2 values per idx
+            const uint iqs = idx % 128;                  // 0..127
+
+            const uint n = iqs / 64;                     // 0,1
+            const uint qsi = n * 32 + (iqs % 16) * 2;    // 0,2,4..62
+            const uint hmi =          (iqs % 16) * 2;    // 0,2,4..30
+            const uint j = (iqs % 64) / 4;               // 0..3
+            const uint is = iqs / 8;                     // 0..15
+            const uint halfsplit = ((iqs % 64) / 16);    // 0,1,2,3
+            const uint qsshift = halfsplit * 2;          // 0,2,4,6
+            const uint m = 1 << (4 * n + halfsplit);     // 1,2,4,8,16,32,64,128
+
+            const int8_t us = int8_t(is <  4 ? (data_a[ib].scales[is-0] & 0xF) | (((data_a[ib].scales[is+8] >> 0) & 3) << 4) :
+                                    is <  8 ? (data_a[ib].scales[is-0] & 0xF) | (((data_a[ib].scales[is+4] >> 2) & 3) << 4) :
+                                    is < 12 ? (data_a[ib].scales[is-8] >>  4) | (((data_a[ib].scales[is+0] >> 4) & 3) << 4) :
+                                            (data_a[ib].scales[is-8] >>  4) | (((data_a[ib].scales[is-4] >> 6) & 3) << 4));
+            const float dl = float(data_a[ib].d) * float(us - 32);
+
+            buf_a[buf_idx    ] = FLOAT_TYPE(dl * float(int8_t((data_a[ib].qs[qsi    ] >> qsshift) & 3) - (((data_a[ib].hmask[hmi    ] & m) != 0) ? 0 : 4)));
+            buf_a[buf_idx + 1] = FLOAT_TYPE(dl * float(int8_t((data_a[ib].qs[qsi + 1] >> qsshift) & 3) - (((data_a[ib].hmask[hmi + 1] & m) != 0) ? 0 : 4)));"""
+
+mulmat_load_q4_K = """
+            const uint idx = pos_a + (loadc_a + l) * p.stride_a / LOAD_VEC_A + loadr_a;
+            const uint buf_idx = (loadc_a + l) * (BK+1) + loadr_a * LOAD_VEC_A;
+
+            const uint ib = idx / 128;                 // 2 values per idx
+            const uint iqs = idx % 128;                // 0..127
+
+            const uint n = iqs / 32;                   // 0,1,2,3
+            const uint b = (iqs % 32) / 16;            // 0,1
+            const uint is = 2 * n + b;                 // 0..7
+            const uint qsi = n * 32 + (iqs % 16) * 2;  // 0,2,4..126
+
+            const vec2 loadd = vec2(data_a[ib].d);
+
+            uint8_t sc;
+            uint8_t mbyte;
+            if (is < 4) {
+                sc    = uint8_t(data_a[ib].scales[is    ] & 63);
+                mbyte = uint8_t(data_a[ib].scales[is + 4] & 63);
+            } else {
+                sc    = uint8_t((data_a[ib].scales[is + 4] & 0xF) | ((data_a[ib].scales[is - 4] >> 6) << 4));
+                mbyte = uint8_t((data_a[ib].scales[is + 4] >>  4) | ((data_a[ib].scales[is    ] >> 6) << 4));
+            }
+            const float d = loadd.x * sc;
+            const float m = loadd.y * mbyte;
+
+            buf_a[buf_idx    ] = FLOAT_TYPE(d * float((data_a[ib].qs[qsi    ] >> (b * 4)) & 0xF) - m);
+            buf_a[buf_idx + 1] = FLOAT_TYPE(d * float((data_a[ib].qs[qsi + 1] >> (b * 4)) & 0xF) - m);"""
+
+mulmat_load_q5_K = """
+            const uint idx = pos_a + (loadc_a + l) * p.stride_a / LOAD_VEC_A + loadr_a;
+            const uint buf_idx = (loadc_a + l) * (BK+1) + loadr_a * LOAD_VEC_A;
+
+            const uint ib = idx / 128;                 // 2 values per idx
+            const uint iqs = idx % 128;                // 0..127
+
+            const uint n = iqs / 32;                   // 0,1,2,3
+            const uint b = (iqs % 32) / 16;            // 0,1
+            const uint is = 2 * n + b;                 // 0..7
+            const uint qsi = n * 32 + (iqs % 16) * 2;  // 0,2,4..126
+            const uint qhi = (iqs % 16) * 2;           // 0,2,4..30
+
+            const uint8_t hm = uint8_t(1 << (iqs / 16));
+
+            const vec2 loadd = vec2(data_a[ib].d);
+
+            uint8_t sc;
+            uint8_t mbyte;
+            if (is < 4) {
+                sc    = uint8_t(data_a[ib].scales[is    ] & 63);
+                mbyte = uint8_t(data_a[ib].scales[is + 4] & 63);
+            } else {
+                sc    = uint8_t((data_a[ib].scales[is + 4] & 0xF) | ((data_a[ib].scales[is - 4] >> 6) << 4));
+                mbyte = uint8_t((data_a[ib].scales[is + 4] >>  4) | ((data_a[ib].scales[is    ] >> 6) << 4));
+            }
+            const float d = loadd.x * sc;
+            const float m = loadd.y * mbyte;
+
+            buf_a[buf_idx    ] = FLOAT_TYPE(d * (float((data_a[ib].qs[qsi    ] >> (b * 4)) & 0xF) + float((data_a[ib].qh[qhi    ] & hm) != 0 ? 16 : 0)) - m);
+            buf_a[buf_idx + 1] = FLOAT_TYPE(d * (float((data_a[ib].qs[qsi + 1] >> (b * 4)) & 0xF) + float((data_a[ib].qh[qhi + 1] & hm) != 0 ? 16 : 0)) - m);"""
+
+mulmat_load_q6_K = """
+            const uint idx = pos_a + (loadc_a + l) * p.stride_a / LOAD_VEC_A + loadr_a;
+            const uint buf_idx = (loadc_a + l) * (BK+1) + loadr_a * LOAD_VEC_A;
+
+            const uint ib = idx / 128;                  // 2 values per idx
+            const uint iqs = idx % 128;                 // 0..127
+
+            const uint n = iqs / 64;                    // 0,1
+            const uint b = (iqs % 64) / 32;             // 0,1
+            const uint is_b = (iqs % 16) / 8;           // 0,1
+            const uint qhshift = ((iqs % 64) / 16) * 2; // 0,2,4,6
+            const uint is = 8 * n + qhshift + is_b;     // 0..15
+            const uint qsi = n * 64 + (iqs % 32) * 2;   // 0,2,4..126
+            const uint qhi = n * 32 + (iqs % 16) * 2;   // 0,2,4..62
+
+            const float dscale = float(data_a[ib].d) * float(data_a[ib].scales[is]);
+
+            buf_a[buf_idx    ] = FLOAT_TYPE(dscale * float(int8_t(((data_a[ib].ql[qsi    ] >> (b * 4)) & 0xF) | (((data_a[ib].qh[qhi    ] >> qhshift) & 3) << 4)) - 32));
+            buf_a[buf_idx + 1] = FLOAT_TYPE(dscale * float(int8_t(((data_a[ib].ql[qsi + 1] >> (b * 4)) & 0xF) | (((data_a[ib].qh[qhi + 1] >> qhshift) & 3) << 4)) - 32));"""
+
 mulmat_body2 = """
        }
        [[unroll]] for (uint l = 0; l < BN; l += loadstride_b) {
@@ -1611,8 +1744,9 @@ layout (push_constant) uniform parameter
    uint ne10; uint ne11; uint ne12; uint ne13; uint nb10; uint nb11; uint nb12; uint nb13;
    uint d_offset;
    float param1; float param2;
-} p;
+} p;"""

+generic_unary_op_funcs = """
 layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;

 layout (binding = 0) readonly buffer A {A_TYPE data_a[];};
@@ -1636,14 +1770,17 @@ uint dst_idx(uint idx) {
    const uint i11 = (idx - i13_offset - i12_offset) / p.ne10;
    const uint i10 = idx - i13_offset - i12_offset - i11*p.ne10;
    return i13*p.nb13 + i12*p.nb12 + i11*p.nb11 + i10*p.nb10;
-}
+}"""

+generic_unary_op_main = """
 void main() {
    if (gl_GlobalInvocationID.x >= p.ne) {
        return;
    }
 """

+generic_unary_op_combined = f"{generic_unary_op_head}\n{generic_unary_op_funcs}\n{generic_unary_op_main}"
+
 generic_binary_op_head = """#version 450

 #extension GL_EXT_shader_16bit_storage : require
@@ -1655,13 +1792,14 @@ layout (push_constant) uniform parameter
    uint ne10; uint ne11; uint ne12; uint ne13; uint nb10; uint nb11; uint nb12; uint nb13;
    uint ne20; uint ne21; uint ne22; uint ne23; uint nb20; uint nb21; uint nb22; uint nb23;
    uint d_offset;
-    uint param1; uint param2;
-} p;
+    float param1; float param2;
+} p;"""

+generic_binary_op_funcs = """
 layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;

 layout (binding = 0) readonly buffer A {A_TYPE data_a[];};
-layout (binding = 1) readonly buffer B {A_TYPE data_b[];};
+layout (binding = 1) readonly buffer B {B_TYPE data_b[];};
 layout (binding = 2) writeonly buffer D {D_TYPE data_d[];};

 uint src0_idx(uint idx) {
@@ -1693,14 +1831,17 @@ uint dst_idx(uint idx) {
    const uint i21 = (idx - i23_offset - i22_offset) / p.ne20;
    const uint i20 = idx - i23_offset - i22_offset - i21*p.ne20;
    return i23*p.nb23 + i22*p.nb22 + i21*p.nb21 + i20*p.nb20;
-}
+}"""

+generic_binary_op_main = """
 void main() {
    if (gl_GlobalInvocationID.x >= p.ne) {
        return;
    }
 """

+generic_binary_op_combined = f"{generic_binary_op_head}\n{generic_binary_op_funcs}\n{generic_binary_op_main}"
+
 # MUL F32
 mul_body = """
    data_d[p.d_offset + dst_idx(gl_GlobalInvocationID.x)] = D_TYPE(FLOAT_TYPE(data_a[src0_idx(gl_GlobalInvocationID.x)]) * FLOAT_TYPE(data_b[src1_idx(gl_GlobalInvocationID.x)]));
@@ -1745,39 +1886,55 @@ cpy_f16_f16_end = """
 """

 # GET_ROWS
-get_rows_body = """
-#extension GL_EXT_control_flow_attributes : enable
-#extension GL_EXT_shader_8bit_storage : require
-
-layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;
-
-layout (binding = 0) readonly buffer X {A_TYPE data_a[];};
-layout (binding = 1) readonly buffer Y {int data_b[];};
-layout (binding = 2) writeonly buffer D {D_TYPE dst[];};
-
+get_rows_float_body = """
 void main() {
-    const uint col = int(gl_GlobalInvocationID.x) * 2;
-    const uint row = int(gl_GlobalInvocationID.y);
+    const uint i00 = gl_GlobalInvocationID.x;
+    const uint i10 = gl_GlobalInvocationID.y;
+    const uint i11 = (gl_GlobalInvocationID.z)/p.ne12;
+    const uint i12 = (gl_GlobalInvocationID.z)%p.ne12;

-    if (col >= p.KY) {
+    if (i00 >= p.ne00) {
        return;
    }

-    const uint r = uint(data_b[row]);
+    const uint i01 = data_b[i10*p.nb10 + i11*p.nb11 + i12*p.nb12];

-    // copy data_a[r*p.KY + col] to dst[row*p.KX + col]
-    const uint xi = r*p.KY + col;
-    const uint di = row*p.KY + col;
+    const uint a_offset = i01*p.nb01 + i11*p.nb02 + i12*p.nb03;
+    const uint d_offset = i10*p.nb21 + i11*p.nb22 + i12*p.nb23;

-    const uint ib = xi/QUANT_K; // block index
-    const uint iqs = (xi%QUANT_K)/QUANT_R; // quant index
-    const uint iybs = di - di%QUANT_K; // y block start index
+#ifndef OPTIMIZATION_ERROR_WORKAROUND
+    data_d[d_offset + i00] = D_TYPE(data_a[a_offset + i00]);
+#else
+    data_d[d_offset + i00] = data_a[a_offset + i00];
+#endif
+}
+"""
+
+get_rows_body = """
+void main() {
+    const uint i00 = (gl_GlobalInvocationID.x)*2;
+    const uint i10 = gl_GlobalInvocationID.y;
+    const uint i11 = (gl_GlobalInvocationID.z)/p.ne12;
+    const uint i12 = (gl_GlobalInvocationID.z)%p.ne12;
+
+    if (i00 >= p.ne00) {
+        return;
+    }
+
+    const uint i01 = data_b[i10*p.nb10 + i11*p.nb11 + i12*p.nb12];
+
+    const uint a_offset = i01*p.nb01 + i11*p.nb02 + i12*p.nb03;
+    const uint d_offset = i10*p.nb21 + i11*p.nb22 + i12*p.nb23;
+
+    const uint ib = a_offset + i00/QUANT_K; // block index
+    const uint iqs = (i00%QUANT_K)/QUANT_R; // quant index
+    const uint iybs = i00 - i00%QUANT_K; // dst block start index
    const uint y_offset = QUANT_R == 1 ? 1 : QUANT_K/2;

    DEQUANT_FUNC

-    dst[iybs + iqs + 0]        = D_TYPE(v.x);
-    dst[iybs + iqs + y_offset] = D_TYPE(v.y);
+    data_d[d_offset + iybs + iqs           ] = D_TYPE(v.x);
+    data_d[d_offset + iybs + iqs + y_offset] = D_TYPE(v.y);
 }
 """

@@ -2418,6 +2575,31 @@ async def main():
        tasks.append(string_to_spv("matmul_q8_0_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q8_0", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
        tasks.append(string_to_spv("matmul_q8_0_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q8_0", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))

+        stream.clear()
+        stream.extend((mulmat_head, shader_int8_ext, shader_float_type, shader_q2_K_defines, mulmat_body1, mulmat_load_q2_K, mulmat_body2))
+        tasks.append(string_to_spv("matmul_q2_k_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q2_K", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
+        tasks.append(string_to_spv("matmul_q2_k_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q2_K", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))
+
+        stream.clear()
+        stream.extend((mulmat_head, shader_int8_ext, shader_float_type, shader_q3_K_defines, mulmat_body1, mulmat_load_q3_K, mulmat_body2))
+        tasks.append(string_to_spv("matmul_q3_k_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q3_K", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
+        tasks.append(string_to_spv("matmul_q3_k_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q3_K", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))
+
+        stream.clear()
+        stream.extend((mulmat_head, shader_int8_ext, shader_float_type, shader_q4_K_defines, mulmat_body1, mulmat_load_q4_K, mulmat_body2))
+        tasks.append(string_to_spv("matmul_q4_k_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q4_K", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
+        tasks.append(string_to_spv("matmul_q4_k_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q4_K", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))
+
+        stream.clear()
+        stream.extend((mulmat_head, shader_int8_ext, shader_float_type, shader_q5_K_defines, mulmat_body1, mulmat_load_q5_K, mulmat_body2))
+        tasks.append(string_to_spv("matmul_q5_k_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q5_K", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
+        tasks.append(string_to_spv("matmul_q5_k_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q5_K", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))
+
+        stream.clear()
+        stream.extend((mulmat_head, shader_int8_ext, shader_float_type, shader_q6_K_defines, mulmat_body1, mulmat_load_q6_K, mulmat_body2))
+        tasks.append(string_to_spv("matmul_q6_k_f32", "".join(stream), {"LOAD_VEC_A": 2, "A_TYPE": "block_q6_K", "B_TYPE": "float", "D_TYPE": "float"}, fp16))
+        tasks.append(string_to_spv("matmul_q6_k_f32_aligned", "".join(stream), {"LOAD_VEC_A": 2, "LOAD_VEC_B": load_vec, "A_TYPE": "block_q6_K", "B_TYPE": vec_type, "D_TYPE": "float"}, fp16))
+
    # Shaders where precision is needed, so no fp16 version

    # mul mat vec
@@ -2426,7 +2608,7 @@ async def main():
        stream.extend((mul_mat_vec_head, shader_int8_ext, shader_f32))

        if i == GGML_TYPE_F16:
-            stream.extend((shader_f16_defines, shader_f16_dequant_func, mul_mat_vec_body))
+            stream.extend((shader_f16_defines, shader_float_dequant_func, mul_mat_vec_body))
        elif i == GGML_TYPE_Q4_0:
            stream.extend((shader_q4_0_defines, shader_q4_0_dequant_func, mul_mat_vec_body))
        elif i == GGML_TYPE_Q4_1:
@@ -2488,25 +2670,32 @@ async def main():
    # get_rows
    for i in range(0, VK_NUM_TYPES):
        stream.clear()
-        stream.extend((generic_head, shader_int8_ext, shader_f32))
+        stream.extend((generic_binary_op_head, shader_int8_ext, shader_f32))
+        optimization_workaround = False

-        if i == GGML_TYPE_F16:
-            stream.extend((shader_f16_defines,  shader_f16_dequant_func,  get_rows_body))
+        if i == GGML_TYPE_F32:
+            stream.extend((shader_f32_defines, generic_binary_op_funcs, get_rows_float_body))
+        elif i == GGML_TYPE_F16:
+            stream.extend((shader_f16_defines, generic_binary_op_funcs, get_rows_float_body))
+            optimization_workaround = True
        elif i == GGML_TYPE_Q4_0:
-            stream.extend((shader_q4_0_defines, shader_q4_0_dequant_func, get_rows_body))
+            stream.extend((shader_q4_0_defines, shader_q4_0_dequant_func, generic_binary_op_funcs, get_rows_body))
        elif i == GGML_TYPE_Q4_1:
-            stream.extend((shader_q4_1_defines, shader_q4_1_dequant_func, get_rows_body))
+            stream.extend((shader_q4_1_defines, shader_q4_1_dequant_func, generic_binary_op_funcs, get_rows_body))
        elif i == GGML_TYPE_Q5_0:
-            stream.extend((shader_q5_0_defines, shader_q5_0_dequant_func, get_rows_body))
+            stream.extend((shader_q5_0_defines, shader_q5_0_dequant_func, generic_binary_op_funcs, get_rows_body))
        elif i == GGML_TYPE_Q5_1:
-            stream.extend((shader_q5_1_defines, shader_q5_1_dequant_func, get_rows_body))
+            stream.extend((shader_q5_1_defines, shader_q5_1_dequant_func, generic_binary_op_funcs, get_rows_body))
        elif i == GGML_TYPE_Q8_0:
-            stream.extend((shader_q8_0_defines, shader_q8_0_dequant_func, get_rows_body))
+            stream.extend((shader_q8_0_defines, shader_q8_0_dequant_func, generic_binary_op_funcs, get_rows_body))
        else:
            continue

-        tasks.append(string_to_spv(f"get_rows_{type_names[i]}", "".join(stream), {"B_TYPE": "float", "D_TYPE": "float16_t"}))
-        tasks.append(string_to_spv(f"get_rows_{type_names[i]}_f32", "".join(stream), {"B_TYPE": "float", "D_TYPE": "float"}))
+        if optimization_workaround:
+            tasks.append(string_to_spv(f"get_rows_{type_names[i]}", "".join(stream), {"B_TYPE": "int", "D_TYPE": "float16_t", "OPTIMIZATION_ERROR_WORKAROUND": "1"}))
+        else:
+            tasks.append(string_to_spv(f"get_rows_{type_names[i]}", "".join(stream), {"B_TYPE": "int", "D_TYPE": "float16_t"}))
+        tasks.append(string_to_spv(f"get_rows_{type_names[i]}_f32", "".join(stream), {"B_TYPE": "int", "D_TYPE": "float"}))

    tasks.append(string_to_spv("mul_mat_vec_p021_f16_f32", mul_mat_p021_src, {"A_TYPE": "float16_t", "B_TYPE": "float", "D_TYPE": "float"}))
    tasks.append(string_to_spv("mul_mat_vec_nc_f16_f32", mul_mat_nc_src, {"A_TYPE": "float16_t", "B_TYPE": "float", "D_TYPE": "float"}))
@@ -2515,20 +2704,20 @@ async def main():
    tasks.append(string_to_spv("norm_f32", f"{generic_head}\n{shader_f32}\n{norm_body}", {"A_TYPE": "float", "D_TYPE": "float"}))
    tasks.append(string_to_spv("rms_norm_f32", f"{generic_head}\n{shader_f32}\n{rms_norm_body}", {"A_TYPE": "float", "D_TYPE": "float"}))

-    tasks.append(string_to_spv("cpy_f32_f32", f"{generic_unary_op_head}\n{cpy_end}", {"A_TYPE": "float", "D_TYPE": "float"}))
-    tasks.append(string_to_spv("cpy_f32_f16", f"{generic_unary_op_head}\n{cpy_end}", {"A_TYPE": "float", "D_TYPE": "float16_t"}))
-    tasks.append(string_to_spv("cpy_f16_f16", f"{generic_unary_op_head}\n{cpy_f16_f16_end}", {"A_TYPE": "float16_t", "D_TYPE": "float16_t"}))
+    tasks.append(string_to_spv("cpy_f32_f32", f"{generic_unary_op_combined}\n{cpy_end}", {"A_TYPE": "float", "D_TYPE": "float"}))
+    tasks.append(string_to_spv("cpy_f32_f16", f"{generic_unary_op_combined}\n{cpy_end}", {"A_TYPE": "float", "D_TYPE": "float16_t"}))
+    tasks.append(string_to_spv("cpy_f16_f16", f"{generic_unary_op_combined}\n{cpy_f16_f16_end}", {"A_TYPE": "float16_t", "D_TYPE": "float16_t"}))

-    tasks.append(string_to_spv("add_f32", f"{generic_binary_op_head}\n{add_body}", {"A_TYPE": "float", "B_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))
+    tasks.append(string_to_spv("add_f32", f"{generic_binary_op_combined}\n{add_body}", {"A_TYPE": "float", "B_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))

    tasks.append(string_to_spv("split_k_reduce", mulmat_split_k_reduce_src, {}))
-    tasks.append(string_to_spv("mul_f32", f"{generic_binary_op_head}\n{mul_body}", {"A_TYPE": "float", "B_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))
+    tasks.append(string_to_spv("mul_f32", f"{generic_binary_op_combined}\n{mul_body}", {"A_TYPE": "float", "B_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))

-    tasks.append(string_to_spv("scale_f32", f"{generic_unary_op_head}\n{scale_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))
+    tasks.append(string_to_spv("scale_f32", f"{generic_unary_op_combined}\n{scale_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))

-    tasks.append(string_to_spv("sqr_f32", f"{generic_unary_op_head}\n{sqr_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))
+    tasks.append(string_to_spv("sqr_f32", f"{generic_unary_op_combined}\n{sqr_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))

-    tasks.append(string_to_spv("clamp_f32", f"{generic_unary_op_head}\n{clamp_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))
+    tasks.append(string_to_spv("clamp_f32", f"{generic_unary_op_combined}\n{clamp_body}", {"A_TYPE": "float", "D_TYPE": "float", "FLOAT_TYPE": "float"}))

    tasks.append(string_to_spv("gelu_f32", f"{generic_head}\n{shader_f32}\n{gelu_body}", {"A_TYPE": "float", "D_TYPE": "float"}))
    tasks.append(string_to_spv("silu_f32", f"{generic_head}\n{shader_f32}\n{silu_body}", {"A_TYPE": "float", "D_TYPE": "float"}))
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@@ -123,6 +123,7 @@ class MODEL_ARCH(IntEnum):
    GEMMA      = auto()
    STARCODER2 = auto()
    MAMBA      = auto()
+    XVERSE     = auto()
    COMMAND_R  = auto()


@@ -191,6 +192,7 @@ MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
    MODEL_ARCH.GEMMA:          "gemma",
    MODEL_ARCH.STARCODER2:     "starcoder2",
    MODEL_ARCH.MAMBA:          "mamba",
+    MODEL_ARCH.XVERSE:         "xverse",
    MODEL_ARCH.COMMAND_R:      "command-r",
 }

@@ -606,6 +608,22 @@ MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
        MODEL_TENSOR.SSM_D,
        MODEL_TENSOR.SSM_OUT,
    ],
+    MODEL_ARCH.XVERSE: [
+        MODEL_TENSOR.TOKEN_EMBD,
+        MODEL_TENSOR.OUTPUT_NORM,
+        MODEL_TENSOR.OUTPUT,
+        MODEL_TENSOR.ROPE_FREQS,
+        MODEL_TENSOR.ATTN_NORM,
+        MODEL_TENSOR.ATTN_Q,
+        MODEL_TENSOR.ATTN_K,
+        MODEL_TENSOR.ATTN_V,
+        MODEL_TENSOR.ATTN_OUT,
+        MODEL_TENSOR.ATTN_ROT_EMBD,
+        MODEL_TENSOR.FFN_NORM,
+        MODEL_TENSOR.FFN_GATE,
+        MODEL_TENSOR.FFN_DOWN,
+        MODEL_TENSOR.FFN_UP,
+    ],
    MODEL_ARCH.COMMAND_R: [
        MODEL_TENSOR.TOKEN_EMBD,
        MODEL_TENSOR.OUTPUT_NORM,
@@ -650,6 +668,10 @@ MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
        MODEL_TENSOR.ROPE_FREQS,
        MODEL_TENSOR.ATTN_ROT_EMBD,
    ],
+    MODEL_ARCH.XVERSE: [
+        MODEL_TENSOR.ROPE_FREQS,
+        MODEL_TENSOR.ATTN_ROT_EMBD,
+    ],
 }

 #
--- a/llama.cpp
+++ b/llama.cpp
@@ -218,6 +218,7 @@ enum llm_arch {
    LLM_ARCH_GEMMA,
    LLM_ARCH_STARCODER2,
    LLM_ARCH_MAMBA,
+    LLM_ARCH_XVERSE,
    LLM_ARCH_COMMAND_R,
    LLM_ARCH_UNKNOWN,
 };
@@ -249,6 +250,7 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
    { LLM_ARCH_GEMMA,           "gemma"      },
    { LLM_ARCH_STARCODER2,      "starcoder2" },
    { LLM_ARCH_MAMBA,           "mamba"      },
+    { LLM_ARCH_XVERSE,          "xverse"     },
    { LLM_ARCH_COMMAND_R,       "command-r"  },
    { LLM_ARCH_UNKNOWN,         "(unknown)"  },
 };
@@ -878,6 +880,25 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
            { LLM_TENSOR_SSM_OUT,         "blk.%d.ssm_out" },
        },
    },
+    {
+        LLM_ARCH_XVERSE,
+        {
+            { LLM_TENSOR_TOKEN_EMBD,      "token_embd" },
+            { LLM_TENSOR_OUTPUT_NORM,     "output_norm" },
+            { LLM_TENSOR_OUTPUT,          "output" },
+            { LLM_TENSOR_ROPE_FREQS,      "rope_freqs" },
+            { LLM_TENSOR_ATTN_NORM,       "blk.%d.attn_norm" },
+            { LLM_TENSOR_ATTN_Q,          "blk.%d.attn_q" },
+            { LLM_TENSOR_ATTN_K,          "blk.%d.attn_k" },
+            { LLM_TENSOR_ATTN_V,          "blk.%d.attn_v" },
+            { LLM_TENSOR_ATTN_OUT,        "blk.%d.attn_output" },
+            { LLM_TENSOR_ATTN_ROT_EMBD,   "blk.%d.attn_rot_embd" },
+            { LLM_TENSOR_FFN_NORM,        "blk.%d.ffn_norm" },
+            { LLM_TENSOR_FFN_GATE,        "blk.%d.ffn_gate" },
+            { LLM_TENSOR_FFN_DOWN,        "blk.%d.ffn_down" },
+            { LLM_TENSOR_FFN_UP,          "blk.%d.ffn_up" },
+        },
+    },
    {
        LLM_ARCH_COMMAND_R,
        {
@@ -2100,10 +2121,6 @@ struct llama_context {
            ggml_backend_free(backend);
        }

-#ifdef GGML_USE_VULKAN
-        ggml_vk_free_cpu_assist();
-#endif
-
        ggml_backend_buffer_free(buf_output);
    }

@@ -3847,6 +3864,16 @@ static void llm_load_hparams(
                    default: model.type = e_model::MODEL_UNKNOWN;
                }
            } break;
+        case LLM_ARCH_XVERSE:
+            {
+                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
+                switch (hparams.n_layer) {
+                    case 32: model.type = e_model::MODEL_7B; break;
+                    case 40: model.type = e_model::MODEL_13B; break;
+                    case 80: model.type = e_model::MODEL_65B; break;
+                    default: model.type = e_model::MODEL_UNKNOWN;
+                }
+            } break;
        case LLM_ARCH_COMMAND_R:
            {
                ml.get_key(LLM_KV_LOGIT_SCALE, hparams.f_logit_scale);
@@ -5200,6 +5227,28 @@ static bool llm_load_tensors(
                        layer.ssm_out = ml.create_tensor(ctx_split, tn(LLM_TENSOR_SSM_OUT, "weight", i), {d_inner, n_embd});
                    }
                } break;
+            case LLM_ARCH_XVERSE:
+                {
+                    model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
+                    {
+                        model.output_norm = ml.create_tensor(ctx_output,       tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd});
+                        model.output      = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_OUTPUT,      "weight"), {n_embd, n_vocab});
+                    }
+                    for (int i = 0; i < n_layer; ++i) {
+                        ggml_context * ctx_layer = ctx_for_layer(i);
+                        ggml_context * ctx_split = ctx_for_layer_split(i);
+                        auto & layer = model.layers[i];
+                        layer.attn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd});
+                        layer.wq = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_Q,   "weight", i), {n_embd, n_embd});
+                        layer.wk = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_K,   "weight", i), {n_embd, n_embd_gqa});
+                        layer.wv = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_V,   "weight", i), {n_embd, n_embd_gqa});
+                        layer.wo = ml.create_tensor(ctx_split, tn(LLM_TENSOR_ATTN_OUT, "weight", i), {n_embd, n_embd});
+                        layer.ffn_norm = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd});
+                        layer.ffn_gate = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff});
+                        layer.ffn_down = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd});
+                        layer.ffn_up   = ml.create_tensor(ctx_split, tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff});
+                    }
+                } break;
            case LLM_ARCH_COMMAND_R:
                {
                    model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
@@ -5238,7 +5287,7 @@ static bool llm_load_tensors(

    ml.done_getting_tensors();

-    ml.init_mappings(true, &model.mlock_mmaps);
+    ml.init_mappings(true, use_mlock ? &model.mlock_mmaps : nullptr);
    model.mappings.reserve(ml.mappings.size());

    // create the backend buffers
@@ -5523,8 +5572,8 @@ static void llm_build_kv_store(
    GGML_ASSERT(kv.size == n_ctx);

    // compute the transposed [n_tokens, n_embd] V matrix
-    struct ggml_tensor * v_cur_t = ggml_transpose(ctx, ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens));
-    //struct ggml_tensor * v_cur_t = ggml_transpose(ctx, v_cur); // TODO: reshape above is likely not needed
+    assert(v_cur->ne[0] == n_embd_v_gqa && v_cur->ne[1] == n_tokens);
+    struct ggml_tensor * v_cur_t = ggml_transpose(ctx, v_cur);
    cb(v_cur_t, "v_cur_t", il);

    struct ggml_tensor * k_cache_view = ggml_view_1d(ctx, kv.k_l[il], n_tokens*n_embd_k_gqa,
@@ -6411,6 +6460,111 @@ struct llm_build_context {
        return gf;
    }

+    struct ggml_cgraph * build_xverse() {
+        struct ggml_cgraph * gf = ggml_new_graph_custom(ctx0, LLAMA_MAX_NODES, false);
+
+        const int64_t n_embd_head = hparams.n_embd_head_v;
+        GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
+        GGML_ASSERT(n_embd_head == hparams.n_rot);
+
+        struct ggml_tensor * cur;
+        struct ggml_tensor * inpL;
+
+        inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
+
+        // inp_pos - contains the positions
+        struct ggml_tensor * inp_pos = build_inp_pos();
+
+        // KQ_mask (mask for 1 head, it will be broadcasted to all heads)
+        struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
+
+        // positions of the tokens in the KV cache
+        struct ggml_tensor * KQ_pos = build_inp_KQ_pos();
+
+        for (int il = 0; il < n_layer; ++il) {
+            struct ggml_tensor * inpSA = inpL;
+
+            cur = llm_build_norm(ctx0, inpL, hparams,
+                    model.layers[il].attn_norm, NULL,
+                    LLM_NORM_RMS, cb, il);
+            cb(cur, "attn_norm", il);
+
+            // self-attention
+            {
+                struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
+                cb(Qcur, "Qcur", il);
+
+                struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, model.layers[il].wk, cur);
+                cb(Kcur, "Kcur", il);
+
+                struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, model.layers[il].wv, cur);
+                cb(Vcur, "Vcur", il);
+
+                Qcur = ggml_rope_custom(
+                    ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos,
+                    n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale,
+                    ext_factor, attn_factor, beta_fast, beta_slow
+                );
+                cb(Qcur, "Qcur", il);
+
+                Kcur = ggml_rope_custom(
+                    ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos,
+                    n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale,
+                    ext_factor, attn_factor, beta_fast, beta_slow
+                );
+                cb(Kcur, "Kcur", il);
+                cur = llm_build_kv(ctx0, model, hparams, kv_self, gf,
+                        model.layers[il].wo, NULL,
+                        Kcur, Vcur, Qcur, KQ_mask, KQ_pos, n_ctx, n_tokens, kv_head, n_kv, 1.0f/sqrtf(float(n_embd_head)), cb, il);
+            }
+
+            if (il == n_layer - 1) {
+                // skip computing output for unused tokens
+                struct ggml_tensor * inp_out_ids = build_inp_out_ids();
+                cur   = ggml_get_rows(ctx0,      cur, inp_out_ids);
+                inpSA = ggml_get_rows(ctx0, inpSA, inp_out_ids);
+            }
+
+            struct ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpSA);
+            cb(ffn_inp, "ffn_inp", il);
+
+            // feed-forward network
+            {
+                cur = llm_build_norm(ctx0, ffn_inp, hparams,
+                        model.layers[il].ffn_norm, NULL,
+                        LLM_NORM_RMS, cb, il);
+                cb(cur, "ffn_norm", il);
+
+                cur = llm_build_ffn(ctx0, cur,
+                        model.layers[il].ffn_up,   NULL,
+                        model.layers[il].ffn_gate, NULL,
+                        model.layers[il].ffn_down, NULL,
+                        NULL,
+                        LLM_FFN_SILU, LLM_FFN_PAR, cb, il);
+                cb(cur, "ffn_out", il);
+            }
+
+            cur = ggml_add(ctx0, cur, ffn_inp);
+            cb(cur, "l_out", il);
+
+            // input for next layer
+            inpL = cur;
+        }
+
+        cur = inpL;
+
+        cur = llm_build_norm(ctx0, cur, hparams, model.output_norm, NULL, LLM_NORM_RMS, cb, -1);
+        cb(cur, "result_norm", -1);
+
+        // lm_head
+        cur = ggml_mul_mat(ctx0, model.output, cur);
+        cb(cur, "result_output", -1);
+
+        ggml_build_forward_expand(gf, cur);
+
+        return gf;
+    }
+
    struct ggml_cgraph * build_falcon() {
        struct ggml_cgraph * gf = ggml_new_graph_custom(ctx0, LLAMA_MAX_NODES, false);

@@ -9152,8 +9306,9 @@ struct llm_build_context {
            if (il == n_layer - 1) {
                // skip computing output for unused tokens
                struct ggml_tensor * inp_out_ids = build_inp_out_ids();
-                cur  = ggml_get_rows(ctx0,  cur, inp_out_ids);
-                inpL = ggml_get_rows(ctx0, inpL, inp_out_ids);
+                cur     = ggml_get_rows(ctx0,     cur, inp_out_ids);
+                inpL    = ggml_get_rows(ctx0,    inpL, inp_out_ids);
+                ffn_inp = ggml_get_rows(ctx0, ffn_inp, inp_out_ids);
            }

            struct ggml_tensor * attn_out = cur;
@@ -9388,6 +9543,10 @@ static struct ggml_cgraph * llama_build_graph(
            {
                result = llm.build_mamba();
            } break;
+        case LLM_ARCH_XVERSE:
+            {
+                result = llm.build_xverse();
+            } break;
        case LLM_ARCH_COMMAND_R:
            {
                result = llm.build_command_r();
@@ -13968,7 +14127,20 @@ struct llama_context * llama_new_context_with_model(
            }
        }
 #elif defined(GGML_USE_VULKAN)
-        if (model->n_gpu_layers > 0) {
+        if (model->split_mode == LLAMA_SPLIT_MODE_ROW) {
+            LLAMA_LOG_ERROR("%s: Row split not supported. Failed to initialize Vulkan backend\n", __func__);
+            llama_free(ctx);
+            return nullptr;
+        }
+        if (model->split_mode == LLAMA_SPLIT_MODE_NONE) {
+            ggml_backend_t backend = ggml_backend_vk_init(0);
+            if (backend == nullptr) {
+                LLAMA_LOG_ERROR("%s: failed to initialize Vulkan backend\n", __func__);
+                llama_free(ctx);
+                return nullptr;
+            }
+            ctx->backends.push_back(backend);
+        } else {
            for (int device = 0; device < ggml_backend_vk_get_device_count(); ++device) {
                ggml_backend_t backend = ggml_backend_vk_init(device);
                if (backend == nullptr) {
@@ -14187,6 +14359,7 @@ enum llama_rope_type llama_rope_type(const struct llama_model * model) {
        case LLM_ARCH_ORION:
        case LLM_ARCH_INTERNLM2:
        case LLM_ARCH_MINICPM:
+        case LLM_ARCH_XVERSE:
        case LLM_ARCH_COMMAND_R:
            return LLAMA_ROPE_TYPE_NORM;

--- a/llama.h
+++ b/llama.h
@@ -60,9 +60,9 @@ extern "C" {

    enum llama_vocab_type {
        LLAMA_VOCAB_TYPE_NONE = 0, // For models without vocab
-        LLAMA_VOCAB_TYPE_SPM  = 1, // SentencePiece
-        LLAMA_VOCAB_TYPE_BPE  = 2, // Byte Pair Encoding
-        LLAMA_VOCAB_TYPE_WPM  = 3, // WordPiece
+        LLAMA_VOCAB_TYPE_SPM  = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback
+        LLAMA_VOCAB_TYPE_BPE  = 2, // GPT-2 tokenizer based on byte-level BPE
+        LLAMA_VOCAB_TYPE_WPM  = 3, // BERT tokenizer based on WordPiece
    };

    // note: these values should be synchronized with ggml_rope
--- a/scripts/gen-authors.sh
+++ b/scripts/gen-authors.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+printf "# date: $(date)\n" > AUTHORS
+printf "# this file is auto-generated by scripts/gen-authors.sh\n\n" >> AUTHORS
+
+git log --format='%an <%ae>' --reverse --date=short master | awk '!seen[$0]++' | sort >> AUTHORS
+
+# if necessary, update your name here. for example: jdoe -> John Doe
+sed -i '' 's/^jdoe/John Doe/g' AUTHORS
--- a/scripts/sync-ggml-am.sh
+++ b/scripts/sync-ggml-am.sh
@@ -65,6 +65,8 @@ while read c; do
        tests/test-quantize-fns.cpp \
        tests/test-quantize-perf.cpp \
        tests/test-backend-ops.cpp \
+        LICENSE \
+        scripts/gen-authors.sh \
        >> $SRC_LLAMA/ggml-src.patch
 done < $SRC_LLAMA/ggml-commits

@@ -95,6 +97,7 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
    # src/ggml-backend-impl.h     -> ggml-backend-impl.h
    # src/ggml-backend.c          -> ggml-backend.c
    # src/ggml-common.h           -> ggml-common.h
+    # src/ggml-cuda/*             -> ggml-cuda/
    # src/ggml-cuda.cu            -> ggml-cuda.cu
    # src/ggml-cuda.h             -> ggml-cuda.h
    # src/ggml-impl.h             -> ggml-impl.h
@@ -121,6 +124,9 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
    # tests/test-quantize-fns.cpp  -> tests/test-quantize-fns.cpp
    # tests/test-quantize-perf.cpp -> tests/test-quantize-perf.cpp
    # tests/test-backend-ops.cpp   -> tests/test-backend-ops.cpp
+    #
+    # LICENSE                      -> LICENSE
+    # scripts/gen-authors.sh       -> scripts/gen-authors.sh

    cat ggml-src.patch | sed \
        -e 's/src\/ggml\.c/ggml.c/g' \
@@ -128,6 +134,7 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
        -e 's/src\/ggml-backend-impl\.h/ggml-backend-impl.h/g' \
        -e 's/src\/ggml-backend\.c/ggml-backend.c/g' \
        -e 's/src\/ggml-common\.h/ggml-common.h/g' \
+        -e 's/src\/ggml-cuda\//ggml-cuda\//g' \
        -e 's/src\/ggml-cuda\.cu/ggml-cuda.cu/g' \
        -e 's/src\/ggml-cuda\.h/ggml-cuda.h/g' \
        -e 's/src\/ggml-impl\.h/ggml-impl.h/g' \
@@ -153,6 +160,8 @@ if [ -f $SRC_LLAMA/ggml-src.patch ]; then
        -e 's/tests\/test-quantize-fns\.cpp/tests\/test-quantize-fns.cpp/g' \
        -e 's/tests\/test-quantize-perf\.cpp/tests\/test-quantize-perf.cpp/g' \
        -e 's/tests\/test-backend-ops\.cpp/tests\/test-backend-ops.cpp/g' \
+        -e 's/LICENSE/LICENSE/g' \
+        -e 's/scripts\/gen-authors\.sh/scripts\/gen-authors.sh/g' \
        > ggml-src.patch.tmp
    mv ggml-src.patch.tmp ggml-src.patch

--- a/scripts/sync-ggml.sh
+++ b/scripts/sync-ggml.sh
@@ -5,6 +5,7 @@ cp -rpv ../ggml/src/ggml-alloc.c            ./ggml-alloc.c
 cp -rpv ../ggml/src/ggml-backend-impl.h     ./ggml-backend-impl.h
 cp -rpv ../ggml/src/ggml-backend.c          ./ggml-backend.c
 cp -rpv ../ggml/src/ggml-common.h           ./ggml-common.h
+cp -rpv ../ggml/src/ggml-cuda/*             ./ggml-cuda/
 cp -rpv ../ggml/src/ggml-cuda.cu            ./ggml-cuda.cu
 cp -rpv ../ggml/src/ggml-cuda.h             ./ggml-cuda.h
 cp -rpv ../ggml/src/ggml-impl.h             ./ggml-impl.h
@@ -30,3 +31,6 @@ cp -rpv ../ggml/include/ggml/ggml-backend.h ./ggml-backend.h
 cp -rpv ../ggml/tests/test-opt.cpp         ./tests/test-opt.cpp
 cp -rpv ../ggml/tests/test-grad0.cpp       ./tests/test-grad0.cpp
 cp -rpv ../ggml/tests/test-backend-ops.cpp ./tests/test-backend-ops.cpp
+
+cp -rpv ../LICENSE                         ./LICENSE
+cp -rpv ../ggml/scripts/gen-authors.sh     ./scripts/gen-authors.sh
Author	SHA1	Message	Date
Georgi Gerganov	072e0a4d3b	scipts : add LICENSE and gen-authors.sh to sync	2024-04-09 09:19:33 +03:00
Georgi Gerganov	0e0d4e821f	authors : update	2024-04-09 09:17:25 +03:00
Georgi Gerganov	805d705032	license : add AUTHORS	2024-03-31 10:37:39 +03:00
Pierrick Hymbert	37e7854c10	ci: bench: fix Resource not accessible by integration on PR event (#6393 )	2024-03-30 12:36:07 +02:00
Mohammadreza Hendiani	c342d070c6	Fedora build update (#6388 ) * fixed deprecated address * fixed deprecated address * fixed deprecated address * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * Added 'Apache-2.0' SPDX license identifier due to 'kompute.cc' submodule licensing. Explanation of licensing method: https://docs.fedoraproject.org/en-US/legal/spdx/#_and_expressions * reverted back to only the MIT license	2024-03-29 22:59:56 +01:00
Xuan Son Nguyen	f7fc5f6c6f	split: allow --split-max-size option (#6343 ) * split by max size * clean up arg parse * split: ok * add dry run option * error on 0 tensors * be positive * remove next_metadata_size	2024-03-29 22:34:44 +01:00
0cc4m	ba0c7c70ab	Vulkan k-quant mmq and ggml-backend offload functionality (#6155 ) * Fix Vulkan no kv offload incoherence * Add k-quant mul mat mat shaders * Rework working buffer allocation, reduces vram use noticeably Clean up cpu assist code, replaced with ggml-backend offload function * Default to all dedicated GPUs * Add fallback for integrated GPUs if no dedicated GPUs are found * Add debug info which device is allocating memory * Fix Intel dequant issue Fix validation issue * Fix Vulkan GGML_OP_GET_ROWS implementation * Clean up merge artifacts * Remove Vulkan warning	2024-03-29 17:29:21 +01:00
Georgi Gerganov	d48ccf3ad4	sync : ggml (#6351 ) * sync : ggml ggml-ci * cuda : move GGML_CUDA_DMMV constants to dmmv.cuh --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-03-29 17:45:46 +02:00
hxer7963	069574775c	[Model] Add support for xverse (#6301 ) * Support xverse model convert to gguf format. * 1. Convert xverse models to gguf; 2. Add LLM_ARCH_XVERSE inference in llama.cpp; 3. Add xverse item in Supported models in README.md; * * gguf-py: remove redundant logs * llama: remove the init_mapping_prefetch custom parameter * llama.cpp: Include the changes from #6122 to exclude the unused outputs of the last layers. * - Fix format issues - Remove duplicate set kqv_out to llm_build_kv * Update llama.cpp --------- Co-authored-by: willhe <willhe@xverse.cn> Co-authored-by: willhe <hexin@xverse.cn>	2024-03-29 14:37:03 +01:00
Georgi Gerganov	cfde806eb9	ci : fix BGE wget (#6383 ) ggml-ci	2024-03-29 14:34:28 +02:00
zhouwg	b910287954	readme : add project (#6356 ) * readme: add Android UI binding * Update README.md	2024-03-29 09:33:46 +02:00
Matt Clayton	8093987090	cmake : add explicit metal version options (#6370 ) * cmake: add explicit metal version options * Update CMakeLists.txt --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-29 09:27:42 +02:00
Daniel Bevenius	057400a3fd	llama : remove redundant reshape in build_kv_store (#6369 ) * llama: remove redundant reshape in build_kv_store This commit removes the reshape of the V matrix in the build_kv_store. The motivation for this is that V matrix has the shape: ```console (gdb) p v_cur $46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` And after reshaping this tensor we get: ```console gdb) p ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens) $44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_RESHAPE, op_params = { 0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0, name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"} ``` I noticed that the `src` and `view_src` fields are different but that the dimensions are the same. From the code comment it seems like the reshape call is not needed and perhaps the above can motivate the removal of the reshape call. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * llama : add assert --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-29 09:23:22 +02:00
Pedro Cuenca	b75c38166c	convert : allow conversion of Mistral HF models (#6144 ) * Allow conversion of Mistral HF models * Homogenize Llama, Mistral, Mixtral under the same entry. * Fix tokenizer, permute tensors * Use sentencepiece tokenizer, or fall back to hfft. * convert-hf : small fix for mypy * convert-hf : fix duplicated block_count * convert-hf : add vocab size to metadata --------- Co-authored-by: Jared Van Bortel <jared@nomic.ai>	2024-03-29 09:15:00 +02:00
Georgi Gerganov	bfe7dafc9c	readme : add notice for UI list	2024-03-28 22:56:03 +02:00
Ouadie EL FAROUKI	5106ef482c	[SYCL] Revisited & updated SYCL build documentation (#6141 ) * Revisited & updated SYCL build documentation * removed outdated comment * Addressed PR comments * Trimed white spaces * added new end line	2024-03-28 16:01:47 +00:00
Jared Van Bortel	be55134a53	convert : refactor vocab selection logic (#6355 )	2024-03-28 11:44:36 -04:00
Ziang Wu	66ba560256	llava : fix MobileVLM (#6364 ) * fix empty bug * Update MobileVLM-README.md added more results on devices * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update MobileVLM-README.md * Update examples/llava/MobileVLM-README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update MobileVLM-README.md remove gguf links --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-03-28 16:33:10 +02:00
compilade	0308f5e3d7	llama : fix command-r inference when omitting outputs (#6367 )	2024-03-28 14:05:54 +02:00
Pierrick Hymbert	28cb9a09c4	ci: bench: fix master not schedule, fix commit status failed on external repo (#6365 )	2024-03-28 11:27:56 +01:00