LMQL support various decoding algorithms, which are used to generate text from the token distribution of a language model. For this, decoding algorithm in use, can be specified right at the beginning of a query, e.g. using a decoder keyword like
All supported decoding algorithms are model-agnostic and can be used with any LMQL-supported inference backend. For more information on the supported inference backends, see the Models chapter.
Setting The Decoding Algorithm
Depending on context, LMQL offers two ways to specify the decoding algorithm to use.
Decoder Configuration as part of the query: The first option is to simply specify the decoding algorithm and its parameters as part of the query itself. This can be particularly useful, if your choice of decoder is relevant to the concrete program you are writing.
# use beam search with beam width 2 for # the entire program beam(n=2) # uses beam search to generate RESPONSE "This is a query with a specified decoder: [RESPONSE]"
Decoding algorithms are always specified for the entire query program, and cannot change within a program. To use different decoders for different parts of your program, you have to split your program into multiple queries.
Specifying the Decoding Algorithm Externally: The second option is to specify the decoding algorithm and parameters externally, i.e. separatly from the actual program code:
import lmql def tell_a_joke(): '''lmql """A list good dad joke. A indicates the punchline: Q:[JOKE] A:[PUNCHLINE]""" where STOPS_AT(JOKE, "?") and STOPS_AT(PUNCHLINE, "\n") ''' tell_a_joke() # uses the decoder specified in @lmql.query(...) tell_a_joke(decoder="beam", n=2) # uses a beam search decoder with n=2
This is only possible when using LMQL from a Python context.
In general, the very first keyword of an LMQL query, specifies the decoding algorithm to use. For this, the following decoder keywords are available:
argmax decoder is the simplest decoder available in LMQL. It greedily selects the most likely token at each step of the decoding process. It has no additional parameters. Since
argmax decoding is deterministic, one can only generate a single sequence at a time.
sample(n: int, temperature: float)
sample decoder samples
n sequences in parallel from the model. The
temperature parameter controls the randomness of the sampling process. Higher values of
temperature lead to more random samples, while lower values lead to more likely samples. A temperature value of
0.0 is equivalent to the
A simple beam search decoder. The
n parameter controls the beam size. The beam search decoder is deterministic, so it will generate the same
n sequences every time. The result of a
beam query is a list of
n sequences, sorted by their likelihood.
beam_sample(n: int, temperature: float)
A beam search decoder that samples from the beam at each step. The
n parameter controls the beam size, while the
temperature parameter controls the randomness of the sampling process. The result of a
beam_sample query is a list of
n sequences, sorted by their likelihood.
LMQL also implements a number of novel decoders. These decoders are experimental and may not work as expected. They are also not guaranteed to be stable across different LMQL versions. More documentation on these decoders will be provided in the future.
var(b: int, n: int)
An experimental implementation of variable-level beam search.
An experimental implementation of a beam search procedure that groups by currently-decoded variable and applies adjusted length penalties.
Inspecting Decoding Trees
LMQL also provides a way to inspect the decoding trees generated by the decoders. For this, make sure to execute the query in the Playground IDE and click on the
Advanced Mode button, in the top right corner of the Playground. This will open a new pane, where you can navigate and inspect the LMQL decoding tree:
This view allows you to track the decoding process, active hypotheses and interpreter state, including the current evaluation result of the
where clause. For an example, take a look at the translation example in the Playground (with Advanced Mode enabled).
Writing Custom Decoders
LMQL also includes a library for array-based decoding
dclib, which can be used to implement custom decoders. More information on this, will be provided in the future. The implementation of the available decoding procedures is located in
src/lmql/runtime/dclib/decoders.py of the LMQL repository.
Additional Decoding Parameters
Next to the decoding algorithm, LMQL also supports a number of additional decoding parameters, which can affect sampling behavior and token scoring:
|The maximum length of the generated sequence. If not specified, the default value of |
|Restricts the number of tokens to sample from in each step of the decoding process, based on Fan et. al(2018) (only applicable for sampling decoders).|
|Top-p (nucleus) sampling, based on Holtzman et. al(2019) (only applicable for sampling decoders).|
|Repetition penalty, |
Note that the concrete implementation and availability of additional decoding parameters may vary across different inference backends. For reference, please see the API documentation of the respective inference interface, e.g. the HuggingFace
generate() function or the OpenAI API.
Lastly, a number of additional runtime parameters are available, which can be used to control auxiliary aspects of the decoding process:
|The chunksize parameter used for |
|Enables verbose console logging for individual LLM inference calls (local generation calls or OpenAI API request payloads).|
|Experimental option for OpenAI-specific non-stop generation, which can further improve the effectiveness of caching in some scenarios.|
|OpenAI-specific maximum time in seconds to wait for the next chunk of tokens to arrive. If exceeded, the current API request will be retried with an approriate backoff. |
If not specified, the default value of