Skip to content


LMQL support various decoding algorithms, which are used to generate text from the token distribution of a language model. For this, decoding algorithm in use, can be specified right at the beginning of a query, e.g. using a decoder keyword like argmax.

All supported decoding algorithms are model-agnostic and can be used with any LMQL-supported inference backend. For more information on the supported inference backends, see the Models chapter.

Setting The Decoding Algorithm

Depending on context, LMQL offers two ways to specify the decoding algorithm to use.

Decoder Configuration as part of the query: The first option is to simply specify the decoding algorithm and its parameters as part of the query itself. This can be particularly useful, if your choice of decoder is relevant to the concrete program you are writing.

# use beam search with beam width 2 for
# the entire program

# uses beam search to generate RESPONSE 
"This is a query with a specified decoder: [RESPONSE]"

Decoding algorithms are always specified for the entire query program, and cannot change within a program. To use different decoders for different parts of your program, you have to split your program into multiple queries.

Specifying the Decoding Algorithm Externally: The second option is to specify the decoding algorithm and parameters externally, i.e. separatly from the actual program code:

import lmql

@lmql.query(model="openai/text-davinci-003", decoder="sample", temperature=1.8)
def tell_a_joke():
    """A list good dad joke. A indicates the punchline:
    A:[PUNCHLINE]""" where STOPS_AT(JOKE, "?") and  STOPS_AT(PUNCHLINE, "\n")

tell_a_joke() # uses the decoder specified in @lmql.query(...)
tell_a_joke(decoder="beam", n=2) # uses a beam search decoder with n=2

This is only possible when using LMQL from a Python context.

Decoding Algorithms

In general, the very first keyword of an LMQL query, specifies the decoding algorithm to use. For this, the following decoder keywords are available:


The argmax decoder is the simplest decoder available in LMQL. It greedily selects the most likely token at each step of the decoding process. It has no additional parameters. Since argmax decoding is deterministic, one can only generate a single sequence at a time.

sample(n: int, temperature: float)

The sample decoder samples n sequences in parallel from the model. The temperature parameter controls the randomness of the sampling process. Higher values of temperature lead to more random samples, while lower values lead to more likely samples. A temperature value of 0.0 is equivalent to the argmax decoder.

beam(n: int)

A simple beam search decoder. The n parameter controls the beam size. The beam search decoder is deterministic, so it will generate the same n sequences every time. The result of a beam query is a list of n sequences, sorted by their likelihood.

beam_sample(n: int, temperature: float)

A beam search decoder that samples from the beam at each step. The n parameter controls the beam size, while the temperature parameter controls the randomness of the sampling process. The result of a beam_sample query is a list of n sequences, sorted by their likelihood.

Novel Decoders

LMQL also implements a number of novel decoders. These decoders are experimental and may not work as expected. They are also not guaranteed to be stable across different LMQL versions. More documentation on these decoders will be provided in the future.

var(b: int, n: int)

An experimental implementation of variable-level beam search.

beam_var(n: int)

An experimental implementation of a beam search procedure that groups by currently-decoded variable and applies adjusted length penalties.

Inspecting Decoding Trees

LMQL also provides a way to inspect the decoding trees generated by the decoders. For this, make sure to execute the query in the Playground IDE and click on the Advanced Mode button, in the top right corner of the Playground. This will open a new pane, where you can navigate and inspect the LMQL decoding tree:

A decoding tree as visualized in the LMQL Playground.
A decoding tree as visualized in the LMQL Playground.

This view allows you to track the decoding process, active hypotheses and interpreter state, including the current evaluation result of the where clause. For an example, take a look at the translation example in the Playground (with Advanced Mode enabled).

Writing Custom Decoders

LMQL also includes a library for array-based decoding dclib, which can be used to implement custom decoders. More information on this, will be provided in the future. The implementation of the available decoding procedures is located in src/lmql/runtime/dclib/ of the LMQL repository.

Additional Decoding Parameters

Next to the decoding algorithm, LMQL also supports a number of additional decoding parameters, which can affect sampling behavior and token scoring:

max_len: intThe maximum length of the generated sequence. If not specified, the default value of max_len is 2048. Note if the maximum length is reached, the LMQL runtime will throw an error if the query has not yet come to a valid result, according to the provided where clause.
top_k: intRestricts the number of tokens to sample from in each step of the decoding process, based on Fan et. al(2018) (only applicable for sampling decoders).
top_p: floatTop-p (nucleus) sampling, based on Holtzman et. al(2019) (only applicable for sampling decoders).
repetition_penalty: floatRepetition penalty, 1.0 means no penalty, based on Keskar et. al(2019). The more a token is already present in the generated sequence, the more its probability will be penalized.
frequency_penalty: floatfrequency_penalty as documented as part of the OpenAI API.
presence_penalty: floatpresence_penalty as documented as part of the OpenAI API.


Note that the concrete implementation and availability of additional decoding parameters may vary across different inference backends. For reference, please see the API documentation of the respective inference interface, e.g. the HuggingFace generate() function or the OpenAI API.

Runtime Parameters

Lastly, a number of additional runtime parameters are available, which can be used to control auxiliary aspects of the decoding process:

chunksize: intThe chunksize parameter used for max_tokens in OpenAI API requests or in speculative inference with local models. If not specified, the default value of chunksize is 32. See also the description of this parameter in the Models chapter.
verbose: boolEnables verbose console logging for individual LLM inference calls (local generation calls or OpenAI API request payloads).
cache: Union[bool,str]True or False to enable in-memory token caching. If not specified, the default value of cache is True, indicating in-memory caching is enabled.

Setting cache to a string value, specifies a local file to use for disk-based caching, enabling caching across multiple query executions and sessions.
openai_nonstopExperimental option for OpenAI-specific non-stop generation, which can further improve the effectiveness of caching in some scenarios.
chunk_timeoutOpenAI-specific maximum time in seconds to wait for the next chunk of tokens to arrive. If exceeded, the current API request will be retried with an approriate backoff.

If not specified, the default value of chunk_timeout is 2.5. Adjust this parameter, if you are seeing a high number of timeouts in the console output of the LMQL runtime.