One of the most fascinating aspects of autoregressive large language models (LLMs) like GPT-3 is their ability to act through external tools. In this post, I’ll illustrate how LMQL (Beurer-Kellner et al., 2022), a new programming language for language model interaction, helps better steer such LLM agents. I’ll take ReAct (Yao et al., 2022) as an example and show how to enforce constraints on the task-solving trajectory, the choice of tools and the tools’ inputs.
Taken in isolation, LLMs know at most what they have been shown during their training and can only generate text. However, external tools can give them access to external knowledge sources and enable them to act in the real world. Such tools can be a web search engine, a calculator, a Python interpreter, an email client…
Following the ReAct paper, we’ll imagine in the following example that we try to answer questions thanks to two tools:
search(entity)
returns the first sentences of the Wikipedia page of entity
(or suggests the most similar entities if entity
doesn’t have a Wikipedia page);lookup(string)
returns the next sentence containing string
in the last page accessed with the search
tool.Let’s now see how the ReAct prompting method works:
Although very simple and easy to implement, ReAct yields good results with large enough language models. However, this method relies on the LLM rigorously following the prescribed format. In practice, for example with Langchain’s implementation of ReAct, the LLM may deviate from it and even hallucinate imaginary tools, which prevents its output from being parsed. And this is where LMQL comes into play…
LMQL is a declarative query language for LLMs. With LMQL queries, developers can concisely and intuitively specify the constraints applicable to a sequence of text generation steps. For example, the LMQL query below implements the ReAct method:
I invite you to learn about the LMQL syntax through the LMQL documentation and examples. In a nutshell:
argmax
denotes a greedy decoding strategy for the text generation;argmax
and from
specifies the structure of the text to create. It includes some Python code responsible for the control flow and top-level strings that are progressively added to the text as they appear;[X]
. At this location, the LLM will generate some text that will be stored in a variable named X
;from
clause specifies which model to call;where
clause defines constraints on the variables fed by the LLM. For example, the STOPS_BEFORE(X, c)
constraint interrupts the text generation for variable X
when character c
is generated. c
is discarded so that X
does not end with c
.Assuming that you have previously defined the Python strings named prompt_start
and questions
, as well as a Python dictionary mapping the strings “Search” and “Lookup” to the corresponding functions, this LMQL query is all you need to run ReAct.
It is concise and easy to understand. Furthermore, as opposed to the LangChain ReAct implementation, it guarantees that the output text follows the expected format. For example, the LLM cannot invent new actions.
The tools of the ReAct agent may not be relevant or available at all times. For example:
lookup
tool provides valid results only if the search
tool has been used at least once.In all these cases, such restrictions can be specified with LMQL. For instance, the LMQL query below corresponds to the same ReAct agent as before but with the lookup
tool not available for the first action (changes in green).
A tool in Langchain is implemented through a string-to-string function that computes the “observation” provided to the LLM. However, in many cases, the input should not be an arbitrary string. This string may be expected to take only one of a few predetermined values or to be formatted in a very specific way, for example following a JSON schema.
Such constraints, which may not be satisfied in Langchain, can be enforced with LMQL. For this, we can simply condition on the value of the ACTION
variable, specify the structure of the input string and add the right constraints on the corresponding variables.
In the example below (changes in green), we imagine that we have a new tool search-lookup
that combines search
and lookup
in one function and takes two arguments (the entity of the Wikipedia page and the word to look up). In this case, we of course need to adjust the instructions or the few-shot examples so that the LLM considers the use of the new tool.
The last example covered in this blog post concerns the OpenAPI hierarchical planning agent implemented in Langchain.
This agent is designed to accomplish high-level tasks with multiple API calls in the context of many potential API endpoints. It combines a planner responsible for listing the endpoints to call in the right order and a controller responsible for calling these endpoints.
In the example in the Langchain documentation, the user query is “make me a playlist with the first song from kind of blue. call it machine blues” and the planner is provided with the Spotify OpenAPI specification.
In this situation, a planner based on gpt-4
devised the correct plan below while a planner based on text-davinci-003
includes invalid endpoints:
1. GET /search to search for the album "Kind of Blue"
2. GET /albums/{id}/tracks to get the tracks from the "Kind of Blue" album
3. GET /me to get the current user's information
4. POST /users/{user_id}/playlists to create a new playlist named "Machine Blues" for the current user
5. POST /playlists/{playlist_id}/tracks to add the first song from "Kind of Blue" to the "Machine Blues" playlist
Here it is of course simple to create a planner with LMQL that only references valid endpoints and follows the expected format (in this case, this is however not enough to build a successful plan…), as demonstrated in this notebook.
Constraining the text generation creates a computational overhead. This is negligible with locally-hosted models and simple constraints as those used in this blog post.
This is unfortunately not true when calling the OpenAI API. All things being equal, enforcing constraints leads to additional calls and all the tokens, even those previously fed to or generated by the API, are billed. This increases the API costs, as shown in this notebook.
Of course, guaranteeing that the constraints are satisfied may reduce costs. For example, it can help achieve comparable performance with shorter prompts or less expensive models. It can also prevent generating texts that are unparseable and thus useless. Finally, the LMQL team is currently working on a cache layer that will decrease the number of requests.
LMQL is a promising tool to easily develop more predictable LLM agents and potentially make them safer and more beneficial. However, with the way the OpenAI API is currently designed and billed, this increased robustness may come with higher inference costs. Hopefully, the LLM providers will adapt their offerings and new powerful open source models will be made available so that users can take full advantage of tools like LMQL.
Many thanks to Luca Beurer-Kellner for his feedback and more generally to him and his colleagues for building LMQL.