Skip to content

llama.cpp

llama.cpp is also supported as an LMQL inference backend. This allows the use of models packaged as .gguf files, which run efficiently in CPU-only and mixed CPU/GPU environments using the llama.cpp C++ implementation.

Prerequisites

Before using llama.cpp models, make sure you have installed its Python bindings via pip install llama-cpp-python in the same environment as LMQL. You also need the sentencepiece or transformers package installed, for tokenization. For GPU-enabled llama.cpp inference, you need to install the llama-cpp-python package with the appropriate build flags, as described in its README.md file.

Using llama.cpp Models

Just like Transformers models, you can load llama.cpp models either locally or via a long-lived lmql serve-model inference server.

Model Server

To start a llama.cpp model server, use the following command:

bash
lmql serve-model llama.cpp:<PATH TO WEIGHTS>.gguf

This will launch an LMTP inference endpoint on localhost:8080, which can be used in LMQL, using a corresponding lmql.model(...) object.

Using the llama.cpp endpoint

To access a served llama.cpp model, you can use an lmql.model(...) object with the following client-side configuration:

lmql.model("llama.cpp:<PATH TO WEIGHTS>.gguf", tokenizer="<tokenizer>")

Model Path The client-side lmql.model(...) identifier must always match the exact server-side lmql serve-model GGUF location, even if the path does not exist on the client machine. In this context, it is merely used as a unique identifier for the model.

Tokenizer When omitting tokenizer=..., LMQL will use the transformers-based tokenizer for huggyllama/llama-7b by default. This works for Llama and Llama-based fine-tuned models, but must be adapted for others. To find a matching tokenizer for your concrete gguf file, look up the transformers equivalent entry on the HuggingFace model hub. Alternatively, you can use sentencepiece as a tokenization backend. For this, you have to specify the client-side path to a corresponding tokenizer.model file.

Running Without a Model Server

To load the llama.cpp directly as part of the Python process that executes your query program, you can use the local: prefix, followed by the path to the gguf file:

lmql.model("local:llama.cpp:<PATH TO WEIGHTS>.gguf", tokenizer="<tokenizer>")

Again, you can omit the tokenizer=... argument if you want to use the default tokenizer for huggyllama/llama-7b. If not, you have to specify a tokenizer, as described above.

Configuring the Llama(...) instance

Any parameters passed to lmql serve-model and, when running locally, to lmql.model(...) will be passed to the Llama(...) constructor.

For example, to configure the Llama(...) instance to use an n_ctx value of 1024, run:

bash
lmql serve-model llama.cpp:<PATH TO WEIGHTS>.bin --n_ctx 1024

Or, when running locally, you can use lmql.model("local:llama.cpp:<PATH TO WEIGHTS>.bin", n_ctx=1024).