Skip to content

LMQL Developer Survey

LMQL Developer Survey

February 14, 2024

image

We have started a new initiative called the LMQL developer survey. With this short survey we have the goal of learning more from everyone around the LMQL and the bigger LLM community. We are looking for some broader feedback signals of how and what people are using LMQL for or would like to use it for.

The outcome of this survey will help shape our work around the next major version of LMQL.

You can find the survey here: https://forms.gle/pGvAicNpUhS1rAkK9.

LMQL 0.7 brings Procedural Prompt Programming

LMQL 0.7 brings Procedural Prompt Programming

October 10, 2023

Today, we are releasing LMQL 0.7. This series is the biggest update since the original release, including many community contributions. Next to several new main-line features like nested queries, the Generations API and the Chat API, it also includes several experimental preview features, allowing you to experiment with new incoming functionality before it is fully released.

LMQL 0.7 has also moved to semantic versioning with the direct predecessor being 0.0.6.6. This means that the next feature release will be 0.8, and the next bugfix release will be 0.7.1.

Nested Queries for Procedural Prompt Programming

In 0.7, you can now use Nested Queries to call an LMQL query as a nested function in the context of another query. For this, LMQL implements procedural programming for prompting. To illustrate, consider the following example:

lmql
# chain of thought prompting strategy
@lmql.query
def chain_of_thought():
    '''lmql
    "A: Let's think step by step.\n [REASONING]"
    "Therefore the answer is[ANSWER]" where STOPS_AT(ANSWER, ".")
    return ANSWER.strip()
    '''

# top-level query
"Q: It is August 12th, 2020. What date was it \
    100 days ago? [ANSWER: chain_of_thought]"

ANSWER # May 4th, 2020

We first define a simple LMQL function chain_of_thought to do chain-of-thought prompting. In our top-level query, we can then call this function to decode an answer using the [ANSWER: chain_of_thought] syntax. During execution, LMQL then inserts the instructions and constraints from chain_of_thought into the top-level query, generates a value for ANSWER, and then removes the instructions and constraints again, only returning the final result.

Nested queries are Prompt Function Calls. This design of nested queries is inspired by the idea of function or procedure calls in traditional programming. Removing intermediate instructions and constraints also has parallels to the idea of stack unwinding, a technique to implement function calls in low-level languages.

LMQL transfers these ideas to prompting, inheriting the general benefits of procedural programming:

  • Encapsulation and Model Focus Nested Queries encapsulate and hide the prompting logic used to generate ANSWER, which means our top-level query is much cleaner and more concise. Further, by hiding intermediate instructions from the model in the context of the top-level query, we can reduce noise in the overall prompt, allowing the model to focus on the currently relevant information only, and not get distracted by previous intermediate steps.

  • Nesting and Reuse LMQL queries can be nested arbitrarily deep, allowing you to reuse and combine queries modularly. For example, you could define a query get_year to extract a year from the response text, and then use this query in chain_of_thought to extract the date from the question. By achieving modularity for sub-prompts, nested queries also allow you to reuse prompts across different query programs.

To learn more about nested queries, please refer to the relevant chapter in the documentation.

Generations API

LMQL 0.7 adds the Generations API, a lightweight high-level library for LMQL-based text generation and scoring. The API was designed to be easy to use and does not require users to write any LMQL themselves:

python
# obtain a model instance
m: lmql.LLM = lmql.model("openai/gpt-3.5-turbo-instruct")
# simple generation
m.generate_sync("Hello", max_tokens=10)
# -> Hello, I am a 23 year old female.

Functions such as LLM.generate and LLM.score allow you to generate and score text using any LMQL-support inference backend. The Generations API is also seamlessly compatible with standard LMQL, allowing you to switch and combine the two as needed.

For more information, please refer to the documentation.

Chat

LMQL 0.7 adds a new Chat API, allowing you to easily deploy chatbots with just a couple lines of LMQL.

LMQL Chat comes with custom output writers, that allow you to easily stream chatbot input and output over a variety of channels, including WebSockets, HTTP, and SSE. A simple lmql chat CLI tool was also added, that allows you to instantly launch your LMQL queries as fully interactive chatbots.

We also provide documentation resources on how to get started with chatbot development with LMQL, including chapters on Chatbot Serving, Internal Reasoning and Defending against Prompt Injection. For more information, please refer to the documentation.

Backends

LMQL 0.7 ships with three new backends for inference and tokenization:

  • LMQL 0.7 adds support for OpenAI's newly released gpt-3.5-turbo-instruct model. In contrast to other 3.5 series models, this variant supports the Completions API, which means that LMQL constraints are compatible with it.

  • LMQL now supports hosting models on replicate.com infrastructure, allowing you to run LMQL models in the cloud. To learn more, please refer to the documentation. Thanks a lot to community member @charles-dyfis-net for contributing this!

  • LMQL added sentencepiece as an additional tokenization backend, specifically for llama.cpp models. This means, llama.cpp models can now be used without requiring transformers for tokenization. Thanks a lot to community member @khushChopra for contributing this.

Inference Certificates

To make LLM inference more transparent and re-producible, LMQL 0.7 also adds inference certificates. An inference certificate is a simple data structure that records essential information needed to reproduce an inference result. Certificates can be generated for any LLM call that happens in an LMQL context.

To produce an inference certificate, pass certificate=True or certificate=<filename> to your query or generate call:

truncated
# call and save certificate
say_hello(certificate="my-certificate.json")

The resulting certificate file provides a way to document, trace and reproduce LLM inference results by recording the exact (tokenized) prompts and information on the environment and generation parameters.

This can be helpful to better understand what is happening during inference, to debug issues, and to reproduce results. It also offers a way to document LLM failures, to better guide the discussion around the concrete capabilities and limitations of LLMs.

Decorators

Variable Decorators offer a new and simple way to call custom Python functions as part of the core generation loop in LMQL:

lmql
def screaming(value):
    """Decorator to convert a string to uppercase"""
    return value.upper()

"Say 'this is a test':[@screaming TEST]"
promptdown

Say 'this is a test': TEST THIS IS A TEST

Similar to Python decorators, LMQL decorators are functions that take a variable as input and can wrap and modify its value.

In the example above, we use the @screaming decorator to convert the value of TEST to uppercase. Decorators can be used to implement a wide range of custom functionality, including string normalization, datatype conversion, and more. LMQL also provides decorators that allow to stream or pre-process data during generation. For more information, please refer to the documentation.

Documentation Update

The website and many chapters of the LMQL documentation have also been updated and extended and now include more examples and explanations. We have updated the visual design to make it easier to read and navigate.

The documentation now also includes a work-in-progress Language Reference, which aims to provide a more comprehensive and formal description of LMQL's syntax and semantics, all in one place.

Preview Features

Apart from many new core features, LMQL 0.7 also ships with several experimental preview features, allowing you to test drive new functionality before it has fully stabilized and is released as main-line functionality.

These features are marked as experimental and are not yet fully supported. We are releasing them to gather feedback and to allow users to test them out early on. Note that these features are subject to change and may be removed/modified in future releases.

LMQL Actions Preview

LMQL Actions is the first version of LMQL's function calling layer. It allows you to expose arbitrary Python functions to the LLM reasoning loop and lets the model call them during generation. Function demonstration and the calling protocol can be both handled automatically by the LMQL runtime, allowing for simple use like this:

def wiki(q): ...
def calc(expr): ...

"Q: What is the population of the US and Germany combined?"
"A: [REASONING]" where inline_use(REASONING, [wiki, calc])

A future release will bring more documentation and details on Actions, including how to use and customize it for your use cases. Until then we invite everyone to try and hack with the current implementation, fully contained in actions.py.

Regex Constraints Preview

LMQL now has support for regex constraints, allowing you to use regular expressions to constrain the output of a variable. For example, the following query will always generate a valid date of the form DD/MM:

"It's the last day of June so today is [RESPONSE]" where REGEX(RESPONSE, r"[0-9]{2}/[0-9]{2}")

Types / Datatype Constraints Preview

LMQL is moving towards fully typed LLM generation. On the way there, we have started to add support for dataclass constraints, allowing you to constrain the output of a variable to a specific structured output schema:

lmql
import lmql
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int
    job: str

"Alice is a 21 years old and works as an engineer at LMQL Inc in Zurich, Switzerland.\n"
"Structured: [PERSON_DATA]\n" where type(PERSON_DATA) is Person

PERSON_DATA
# Person(name='Alice', age=21, job='engineer')

To achieve this, LMQL leverages constrained generation to make sure the LLM always produces all information required to populate a valid Person object. The resulting PERSON_DATA object can then be directly used like a regular Python object. Types are still in an early stage and we are working on adding more features and functionality.

Other Changes

🎬 And that's a wrap!

LMQL 0.7 is a big release and we are excited to see what you will build with it. As always, please let us know if you have any questions, suggestions or bug reports, on GitHub, Discord, Twitter or via hello@lmql.ai.

LMQL v0.0.6.6

July 25, 2023

We just released LMQL 0.0.6.6. This is a minor update with a couple of smaller fixes and improvements.

  • lmql.F now supports positional arguments:
python
greet = lmql.F("Greet {a} and {b}: [GREETING]")

# call with positional arguments
greet("Alice", "Bob") # Greet Alice and Bob: Hello!
# call with keyword arguments
greet(a="Alice", b="Bob") # Greet Alice and Bob: Hello!
  • We improved the error handling of the llama.cpp backend and fixed a bug with model identifier parsing.

  • We also fixed a bug with the LMTP scheduler, where CPU load was high even when no tasks were present. Thanks to community member @4onen for reporting and fixing this!

  • Added backend support for auto_gptq quantized models, contributed by community member @meditans.

  • We fixed an issue where for Azure OpenAI models, a dummy configuration api.env was needed. See our documentation for details. Thanks to community members Missing and @hooman-bayer for their feedback and contributions to this.

Versioning Note: 0.0.6.6 is the last release with two leading zeros. Starting with the next release, LMQL will adopt semantic versioning and use a single leading zero, i.e. 0.6.7.

LMQL becomes simpler and adds llama.cpp

LMQL becomes simpler and adds llama.cpp

July 13, 2023

Today we are releasing LMQL 0.0.6.5. This update contains a major simplification of the LMQL syntax, moving it much closer to standard Python. It also includes a llama.cpp based inference backend, several bug fixes and other minor improvements.

You can try the latest version of LMQL in your browser at lmql.ai/playground or install it via pip install lmql.

One Line Is All It Takes

Most notably, 0.0.6.5 comes with several simplifications of the core syntax of LMQL. Of course, all changes are backwards compatible, so you can continue to use your existing query code and move to the new version without any changes.

With this, we aim to minimize syntactic overhead, employing sensible defaults to enable more concise programs like the following:

"One line is all it takes [CONTINUATION]"
promptdown

One line is all it takes CONTINUATIONFallin' in love with me.

Sensible Defaults This is possible because LMQL now automatically assumes argmax and openai/text-davinci-003 as (configurable) default model. If you prefer to use a different model or custom decoder settings, you can still specify them explicitly, e.g. in the @lmql.query decorator function as demonstrated later in this post.

Without any additional configuration, the simple query code above translates to a full LMQL program like this:

argmax "One line is all it takes [CONTINUATION]" from "openai/text-davinci-003"

Inline Constraints

LMQL now allows you to specify several inline where constraints. This enables constraints that refer to local program variables, which means constraints can now be dependent on previous model outputs.

"A list of awesome Dua Lipa songs:\n"
songs = []

"- New Rules\n"
for i in range(4):
    "-[SONG]\n" where STOPS_BEFORE(SONG, "\n")
    songs.append(SONG)

"Out of these, my favorite is[FAVORITE]" where FAVORITE in songs
promptdown

A list of awesome Dua Lipa songs:⏎ - New Rules - SONGDon't Start Now - SONGIDGAF - SONGBe the One - SONGBlow Your Mind (Mwah) Out of these, my favorite is FAVORITEDon't Start Now

Note also how in this example LMQL code now reads much more like standard Python code, without any additional level of indentation.


@lmql.query functions

The overhauled syntax also makes LMQL much easier on the eyes when used with the @lmql.query function decorator in Python:

python
import lmql
import json

@lmql.query(model="openai/text-curie-001", temperature=0.9)
def summarize(): 
    '''lmql
    """
    Provide a summary of Dua Lipa, the pop icon:
    {{
      "name": "[STRING_VALUE]",
      "chart_position": [INT_VALUE],
      "top_songs": [[
         "[STRING_VALUE]",
         "[STRING_VALUE]"
      ]]
    }}
    """ where STOPS_BEFORE(STRING_VALUE, '"') and INT(INT_VALUE) and len(TOKENS(INT_VALUE)) < 3
    
    return json.loads(context.prompt.split("pop icon:",1)[1])
    '''

print(summarize()) # {'name': 'Dua Lipa', 'chart_position': 3415, 'top_songs': ['New Rules', 'Havana']}


lmql.F Lambda Functions

Based on LMQL's new minimal syntax, we introduce a novel and concise way to write LLM-based lambda functions. This offers a lightweight entryway to get started with integrated small LLM-based utilities in your code, without having to write a full LMQL program.

python
import lmql

summarize = lmql.F("Summarize the following in a few words: {data}: [SUMMARY]")
main_subject = lmql.F("What is the main subject (noun) of the following text? {data}: [SUBJECT]", 
                      "len(TOKENS(SUBJECT)) < 20")

text = "In LMQL, users can specify high-level, logical constraints ..."

summarize(data=text) # LMQL enables high-level constraints to be enforced during text 
                     # generation, simplifying multi-part prompting and integration.
main_subject(data=text) # Language Model Query Language (LMQL)



llama.cpp Inference Backend

LMQL now also fully integrates with the excellent llama.cpp C++ implementation of a number of Transformer-based language models.

Using llama.cpp from LMQL is as simple as specifying it in the from clause of a query:

argmax "Say 'this is a test':[RESPONSE]" from "llama.cpp:<PATH TO WEIGHTS>.bin"

We support, both, in-process loading of llama.cpp, as well as remote inference via lmql serve-model. To learn more about llama.cpp and how to use it with LMQL, check out the corresponding chapter in the LMQL documentation.


Other Changes

  • LMQL now includes a random model backend, which randomly samples tokens from the GPT-2 vocabulary. This is useful for debugging and testing purposes and can be used for data generation in the context of highly constrained query programs.

  • Two caching issues have been fixed, avoiding cache collisions which could lead to repeated model outputs.

  • More robust query string parsing, allowing for robust escaping of special characters [, ], { and }.

  • Added support for transformers based Llama models and the associated (fast) implementation of HF tokenizers.

  • Simplified Azure OpenAI support, see the relevant chapter in the documentation.

We thank community members @minosvasilias and @CircArgs for their contribution to this release.

Releasing LMQL v0.0.6.4 LMTP, Azure, Synchronous API, and more

Releasing LMQL 0.0.6.4: LMTP, Azure, Synchronous API, and more

June 8, 2023

Among many things, this update contains several bug fixes and improvements. The most notable changes are:

  • Azure OpenAI support LMQL now supports OpenAI models that are served via Azure. For more information on how to use Azure models, please see the corresponding chapter in the documentation. Many thanks to @veqtor for contributing this feature.

  • Local Models via the Language Model Transport Protocol LMQL 0.0.6.4 implements a novel protocol to stream token output from local models, vastly improving performance. In our first benchmarks, we observed a 5-6x speedup for local model inference. For more information on how to use local models, please see the corresponding chapter in the documentation.

    To learn more about the internals of the new streaming protocol, i.e. the language model transport protocol (LMTP), you can find more details in this README file. In the future, we intend to implement more model backends using LMTP, streamlining communication between LMQL and models.


    LMQL's new streaming protocol (LMTP) allows for faster local model inference.
  • Synchronous Python API Next to an async/await based API, LMQL now also provides a synchronous API. This means you no longer need to use asyncio to use LMQL from Python.

    To use the synchronous API, simply declare @lmql.query function without the async keyword, e.g.

    python
    import lmql
    
    @lmql.query
    def hello(s: str):
        '''lmql
        argmax 
            "Hello {s} [RESPONSE]" 
            return RESPONSE
        from 
            "chatgpt"
        '''
    
    print(hello("world")) # ['Hello! How can I assist you today?']
    

    If you instead want to use lmql.run in a synchronous context, you can now use lmql.run_sync instead. To learn more about how LMQL can be used from Python, check out our documentation.

  • Improved Tokenizer Backends LMQL can now use the excellent tiktoken tokenizer as tokenization backend (for OpenAI models). Furthermore, all tokenization backends have been ported to operate on a byte-level, which improves support for multibyte characters and emojis. This is especially relevant for non-English languages and special characters.

  • Docker Image LMQL now provides a Docker image that can be used to run the LMQL playground in a containerized environment. For more information, please see the documentation. Many thanks to @SilacciA for contributing this feature.

  • Faster Startup Time We optimized LMQL's import hierarchy, which results in faster module loading time.

LMQL Release v0.0.6.3

LMQL v0.0.6.3

May 11, 2023

Today, we are releasing LMQL v0.0.6.3. This update contains several bug fixes and improvements. The most notable changes are:

  • Lighter Runtime As part of our continued efforts, we made LMQL much lighter (no more mandatory transformers dependency). By default LMQL now no longer requires transformers or PyTorch. If you rely on local models, just install LMQL via pip install lmql[hf] to get full Transformers integration.

  • Token Constraints A new function TOKENS(...) was added to the LMQL constraint language, allowing you to specify lower and upper bounds or the exact number of tokens to generate for a given variable.

    argmax 
        "A 10 token response[WHO]" 
    from 
        "openai/text-ada-001" 
    where 
        len(TOKENS(WHO)) == 10
    
  • Conditional Stopping STOPS_AT can now be combined with additional side conditions. This allows you to specify stopping phrases that are only enforced, once other conditions are met.

    For example, below, we stop when the generated text hits a newline character, but only if the overall variable output is already at least 10 tokens long.

    argmax 
        "Hello[WHO]" 
    from 
        "openai/text-ada-001" 
    where 
        len(TOKENS(WHO)) > 10 and STOPS_AT(WHO, "\n")
    
  • lmql.run: Improved input validation for lmql.run as contributed by @lfegray. More specifically, lmql.run wil now provide more helpful error messages when client logic does not specify input values for all required query parameters.

  • Automatic Cache Invalidation: LMQL's tokenizer cache at ~/.cache/lmql is now invalidated automatically when upgrading to a new version. This should prevent issues with outdated cache files.

Note: Version 0.0.6.2 was skipped and yanked from pypi.org, as an invalid release was pushed accidentally.

LMQL Release v0.0.6.1

LMQL v0.0.6.1

May 3, 2023

We released LMQL v0.0.6.1, which contains several bug fixes and improvements. The most notable changes are:

  • Cache Layer Bug Fixes This release contains several fixes and improvements to the recently introduced cache layer.

  • Stopping Phrases Stopping phrases specified via STOPS_BEFORE are now passed to the OpenAI API as "stop" parameter, decreasing the number of tokens used for the request. If you want to disable this (e.g. to allow speculative execution), you can specify the new decoder parameter openai_nonstop=True.

  • Asynchronous Output Writers All output writers have been refactored to use asynchronous I/O. This should simplify integration with other asynchronous frameworks, e.g. for HTTP or Websocket APIs. We also added a new chapter on Output Streaming to the documentation.

  • Output Writers for HTTP endpoints, WebSockets and Server-Sent Events Based on the updated output writer interface, we added three new output writers for serving LMQL queries as HTTP endpoints, WebSockets and via Server-Sent Events (SSE). To learn more, check their relatively simple implementations in the new lmql.output module. We will also provide more documentation on how to use them, e.g. with aiohttp in the future.

Releasing the LMQL Caching Layer (v0.0.6)

Releasing the LMQL Caching Layer (v0.0.6)

May 1, 2023

Today we are releasing LMQL 0.0.6, the first version of LMQL that integrates the LMQL Caching Layer. The caching layer can drastically reduce token use of LLM interaction, lowering both the cost and latency of running queries. In this blog post, we provide a quick overview of the caching layer and demonstrate how it can reduce token use, latency and the number of requests needed to run queries by up to 80%. We observe improvements across a wide range of different scenarios, including template-based queries, long-form constraints and tool augmentation.

You can experiment with LMQL in the browser-based Playground IDE or install the latest version locally, via pip install lmql.

Caching Layer

The caching layer is implemented as a tree-based data structure that caches all model output including logits, tokens, and metadata, allowing the runtime to more efficiently explore the token space of an LLM, even in the presence of multiple variables, constraints and tool augmentation. The cache can be considered an append-only tree, that is explored during query execution, expanding branches according to query code, constraints and speculative execution.

To illustrate the effect of a caching layer, we consider the following example scenarios, all of which now run in a fraction of the time and with a fraction of the tokens needed with traditional querying methods.

Template-Based Queries

When specifying a prompt template with multiple variables to fill in, an LLM typically needs to be invoked once per variable. For instance, consider the following template that guides an LLM in generating a list of things:

argmax
    "A list of things not to forget when going to the sea (not travelling): \n"
    "- Sunglasses \n"
    "-[THING]"
    "-[THING]"
    "-[THING]"
    "-[THING]"
from
    'openai/text-ada-001'
where
    STOPS_AT(THING, "\n")

Without Caching: Tokens: 390, Requests: 4 | With Caching Layer: Tokens: 89 (-77%), Requests: 1 (-75%)

Here, the LLM typically needs to be invoked 4 times, once per [THING] variable. On each call, this incurs a token and latency cost (both with OpenAI and local models). Separate calls are needed, because our template dictates the - token to be inserted before each [THING].

With the caching layer, LMQL can now invoke the LLM only once, and fill in all variables with the resulting tokens, as long as the LLM output already aligns naturally with your template. In case the LLM result of the initial invocation at some point no longer aligns with the template, LMQL will automatically re-invoke the LLM from this point on, guaranteeing an overall consistent result that is already parsed into separate [THING] variables.

Short-Circuiting Long Constraints

When you specify long constraints like A in ["ABCDE", "FGHIJK"], the LMQL runtime guides the LLM to choose one of the provided options and then continues enforcing the sequence until the chosen values is fully decoded. To illustrate, consider the following query:

argmax
    "If we have the choice we choose[OPTION]"
from 
    "openai/text-ada-001"
where
    OPTION in ["Option A with a whole lot of extra context", 
        "Option B with context", 
        "Another Option, also with a lot of additional text"
    ]
promptdown

If we have the choice we choose OPTIONOption A with a whole lot of extra context

Without Caching: Tokens: 123, Requests: 9 | With Caching Layer: Tokens: 25 (-80%), Requests: 2 (-78%)

Here, after the LLM has produced "Option" and then " A", LMQL short-circuits further model calls and automatically completes the resulting sequence to "Option A with a whole lot of extra context". This is possible because once Option A has been predicted, the remaining tokens are fully determined by the constraints.

Tool-Augmented Queries

Lastly, we consider tool augmented queries. LLM agents and tool augmentation are very powerful paradigms, that allow LLMs to incorporate external knowledge and reasoning into their predictions. However, this comes at a cost: On each tool invocation, the LLM needs to be re-invoked to continue decoding after the tool output has been inserted. This impacts both the token cost and latency of running queries, as many requests have to be send forth and back between the LLM and the tool.

As an example, consider the following query that augments an LLM with the ability to use a key-value storage, also runnable in the browser-based LMQL Playground.

Key-Storage Augmented LLM implemented in LMQL

Without Caching: Tokens: 5,162, Requests: 12 | With Caching Layer: Tokens: 3,481 (-33%), Requests: 8 (-33%)

Here, whenever the LLM produces an action relating to our key-value storage, we invoke a tool that handles the storage and return the result (to assign and get stored values). The result of each tool invocation is then inserted into the LLM output, and the LLM is re-invoked to continue decoding.

We count 10 tool interactions which results in 12 requests if we run without caching. However, using the new caching layer, we can reduce this to 8 requests, even undercutting the number of tool interactions. This is possible because the caching layer will not abort LLM generation, if the LLM already correctly predicts the tool output.

This scenario demonstrates that the natural ability of LLMs to complete sequences can be leveraged to reduce the number of tool interactions, by relying on speculative execution.

Persisting the Cache

Of course, the in-memory cache of the LMQL runtime can also be persisted to disk, allowing you to reuse the cache tree across multiple queries, automatically reducing token cost and latency. In some cases this can even be used to reduce the number of requests to the LLM to 0, e.g. if the cache already contains the desired result.

To do so, you can simply specify a cache="file.tokens" parameter in your query code:

argmax(cache="joke.tokens")
   """A good dad joke. A indicates the punchline
   Q:[JOKE]
   A:[PUNCHLINE]"""
from
   "openai/text-davinci-003"
where
   len(JOKE) < 120 and 
   STOPS_AT(JOKE, "?") and 
   STOPS_AT(PUNCHLINE, "\n") and 
   len(PUNCHLINE) > 1

The first successful run of this query will persist the cache to joke.tokens. Subsequent runs will then automatically load the cache from disk, and only invoke the LLM if the cache does not contain a match. This also works for queries whose underlying LLM requests only partially overlap, as the tree-based cache data structure will automatically identify matching subtrees.

Caching During Query Development: Persisting the cache can be particularly useful during query development, as it allows you to reuse the cache across multiple runs of the same query. A persistent cache will reduce token cost and latency of your query, even if you slightly change the query between runs.

Caveats and Disabling the Cache

You can disable the caching layer by specifying cache=False in your query code. This will cause the LMQL runtime to always invoke the LLM, and never use the cache. This is useful for debugging purposes, or if you want to ensure that the LLM is always invoked.

Further, as the cache currently is implemented as an append-only data structure, it will grow indefinitely. This may be problematic for long-running applications, as the cache will eventually grow to relatively large sizes. In the future, we plan to implement simple strategies to limit the cache size, such as a least-recently-used eviction policy.

Conclusion

In this post, we introduced the new caching layer of the LMQL runtime, which allows you to reduce the token cost and latency of your queries by reusing previously generated LLM outputs. We demonstrated how the caching layer can be used to reduce the number of LLM invocations in a variety of scenarios, including long constraints, short-circuiting, and tool-augmented queries. We also showed how the cache can be persisted to disk, allowing you to reuse the cache across multiple queries.

To learn more about LMQL please also check out our documentation, or join our Discord to chat with us directly. We are looking forward to hearing from you!

LMQL Release 0.0.5

LMQL Release 0.0.5

April 17, 2023

Today we are releasing version 0.0.5 of LMQL. This release focuses on stability and performance improvements. For a detailed list of changes, please see below. We are particularly excited about the first community contributions that have been merged as part of this release, with many more in the works.

lmql==0.0.5 has been published on PyPI, based the current main branch of the GitHub repository. The updated version has also been deployed to the browser-based lmql.ai/playground.

Changelog

  • Decoder Performance The argmax and sample decoders have undergone some optimizations, allowing them to run faster. This results in a 20-30% speed-up on common query workloads. #24.

  • Postprocessing Semantics Internally, LMQL now allows constraints to implement postprocessing semantics. This is used to convert variable values after they have been completed, to a more normalized form in the prompt, and to a semantically meaningful data type in the context of the query code. #24.

    For example, when using an INT(<var>) constraint on a generated number, the model will be restricted to only generate valid integers, and now, the resulting NUM value will additionally be converted to an int value:

    argmax
       "My favorite number is: [NUM]\n"
       print(type(NUM), NUM * 2) # <class 'int'> 4
       "Number times two is {NUM * 2}"
    from
       'openai/text-ada-001'
    where
       INT(NUM) 
    
  • Core Interpreter A complete reimplementation of the LMQL core interpreter has been completed. This fixes a couple of minor issues and overall, improves reliability and performance when dealing with branching decoding algorithms. #24.

  • Playground Locally and when used in-browser, the LMQL Playground now streams debugger information from the LMQL interpreter incrementally. This leads to speed-ups when running in the Playground, especially with longer outputs. #27f9a8ad.

  • Other Fixes:

    • When used from within Python (as decorated function), LMQL code no longer has to be doubly-escaped, e.g. you can now write STOPS_AT(VAR, "\n") instead of STOPS_AT(VAR, "\\n")
    • The LMQL inference API buffers requests that come in during startup, to avoid errors when the server is not yet ready. #15, thanks to @chrispan.
    • OpenAI request parallelization no longer leads to an error on Linux systems, with regards to worker processes #6.

Preview

Apart from the changes above, we are also working on a number of other features, including:

  • llama.cpp support as started in this PR, thanks to @CircArgs.

  • Support for Type Constraints, e.g. type(VAR) is DataClass, that automatically force the model to produce a value that structurally conforms to the given type. See this Twitter thread for more details.

  • Support for using Antlr parsers during query execution, to force the model to produce a value that conforms to a given grammar.

  • Extending Logit Masking to OpenAI Chat Models. This will enable full support for LMQL constraints with e.g. chatgpt and gpt-4 models. See #25, thanks to @kharvd.