Skip to content

Local Models / Transformers

LMQL relies on a two-process architecture: The inference process (long-running) loads the model and provides an inference API, and the interpreter process (short-lived) executes your LMQL program.

This architecture is advantageous for locally-hosted models, as the model loading time can be quite long or the required GPU hardware might not even be available on the client machine.

This chapter first discusses how to use of the two-process inference API, and then presents an alternative on how to leverage In-Process Model Loading, which avoids the need for a separate server process within the same architecture.

Inference Architecture

Prerequisites Before using a local model, make sure you installed LMQL via pip install lmql[hf]. This ensures the dependencies for running local models are installed. This requirement also applies to llama.cpp, as LMQL still relies on HuggingFace transformers for tokenization.

Then, to start an LMQL inference server, e.g. for the gpt2-medium model, you can run the following command:

bash
lmql serve-model gpt2-medium --cuda

--cuda will load the model on the GPU, if available. If multiple GPUs are available, the model will be distributed across all GPUs. To run with CPU inference, omit the --cuda flag. If you only want to use a specific GPU, you can specify the CUDA_VISIBLE_DEVICES environment variable, e.g. CUDA_VISIBLE_DEVICES=0 lmql serve-model gpt2-medium.

By default, this exposes an LMQL/LMTP inference API on port 8080. When serving a model remotely, make sure to tunnel/forward the port to your client machine. Now, when executing an LMQL query in the playground or via the CLI, you can simply specify e.g. gpt2-medium, and the runtime will automatically connect to the model server running on port 8080 to obtain model-generated text.

Configuration

Endpoint and Port By default, models will be served via port 8080. To change this, you can specify the port via the --port option of the lmql serve-model command. On the client side, to connect to a model server running on a different port, you can specify the port when constructing an lmql.model object:

python
lmql.model("gpt2", endpoint="localhost:9999")

Model Configuration To load a model with custom quantization preferences or other Transformers arguments, you can specify additional arguments when running the lmql serve-model command. For this, you can provide arbitrary arguments that will directly be passed to the underyling AutoModelForCausalLM.from_pretrained(...) function, as documented in the Transformers documentation.

For example, to set trust_remote_code to True with the from_pretrained function, run the following:

bash
lmql serve-model gpt2-medium --cuda --port 9999 --trust_remote_code True

Alternatively, you can also start to serve a model directly from within a Python environment, by running lmql.serve("gpt2-medium", cuda=True, port=9999, trust_remote_code=True). Just as with the CLI, standard transformers arguments are passed through, to the AutoModel.from_pretrained function.

In-Process Models

If you would like to load the model in-process, without having to execute a separate lmql serve-model command, you can do so by instantiating a custom lmql.model object with local: as part of the model name. For example, to load the gpt2-medium model in-process, run the following command:

python
argmax "Hello[WHO]" from lmql.model("local:gpt2")

Note however, that this will load the model on each restart of the LMQL process, which can incur a significant overhead.

If you want more control over model loading and configuration, you can pass additional arguments to lmql.model(...), as demonstrated below.

python
lmql.model("local:gpt2", cuda=True)

Quantization

Quantization reduces the precision of model parameters to shrink model size and boost inference speed with minimal accuracy loss. LMQL supports two quantization formats: AWQ (using AutoAWQ) and GPTQ (using AutoGPTQ).

AutoAWQ

AWQ minimizes quantization error by protecting crucial weights, promoting model efficiency without sacrificing accuracy. It's ideal for scenarios requiring both compression and acceleration of LLMs.

Install AutoAWQ following the repo instructions. To use AWQ-quantized models, run:

To use AWQ-quantized models, first install AutoAWQ using the instructions in the repo.

bash
lmql serve-model TheBloke/Mistral-7B-OpenOrca-AWQ --loader awq

AutoGPTQ

AutoGPTQ reduces model size while retaining performance by lowering the precision of model weights to 4 or 3 bits. It's suitable for efficient deployment and operation of LLMs on consumer-grade hardware.

Install AutoGPTQ following the repo instructions. To use GPTQ-quantized models, run:

bash
lmql serve-model TheBloke/Arithmo-Mistral-7B-GPTQ --loader gptq