From monolithic to modular open LLM agents

A system of experts via prompt chaining

March 6, 2024


Goal of this article is to demonstrate how chaining simple, minimalistic zero-shot prompts can lead to useful agentic behavior even when used with smaller, general-purpose LLMs. To reliably extract specific pieces of information generated by one module and use it as input for another module, this article makes heavy use of schema-guided generation.


LLM agents require a wide range of capabilities to function correctly. They must be able to decompose complex user instructions, plan actions, interact with their environment using tools, call tools with correct arguments, reason about observations and adjust planning if needed.

Instructing a model to behave like an agent is often done with a comprehensive, more or less monolithic prompt. Commercial LLMs like GPT-4 are more capable of understanding such complex prompts and have important features like function calling already built in. Open LLMs, especially smaller ones, still struggle to do so. An alternative is to fine-tune smaller open LLMs on agent behavior traces created by larger models. Common to both approaches however is that it’s still a monolithic expert providing all the diverse capabilites of an agent (Figure 1, left).

Inspired by prompt chaining I wondered how far I can get with decomposing an LLM agent into a system of smaller, highly specialized modules, each using a simple zero-shot prompt covering only one of an agent’s many capabilities (Figure 1, right). Smaller LLMs often work more reliably if prompted with a simple and narrow instruction. To further improve the performance of a module, it can optionally be fine-tuned on its specific role in the agentic system, like function calling as shown later, independent of other modules.


Figure 1. Monolithic approach (left) vs modular approach, as used in this article (right). M: Mistral-7B-Instruct-v0.2, CL: CodeLlama-7B-instruct, NR: NexusRaven-V2-13B.


At the core of the system is a ReAct-style agent loop that uses a planning module to plan actions. An action is defined by a selected tool and a task description. Executing the selected tool results in an observation. The agent uses short-term memory for recording task-observation pairs (scratchpad) and conversational memory for recording interactions with the user.


The planner reasons about the user request, summarizes relevant information from previous task-observation pairs, thinks about the next useful steps (CoT), generates a task description for the very next step and selects an appropriate tool based on its name and short description (sample prompt and completion). This information is then returned to the agent loop which executes the selected tool. The planner uses a zero-shot prompted Mistral-7B-Instruct-v0.2 model.


Observations i.e. tool execution results can vary significantly in size and relevance. For example, a calculator may output a single number whereas a search engine may return large amounts of text, most of it not relevant for the current task. A summarizer uses that observation to formulate a short, task-specific answer (sample prompt and completion). This makes it much easier for the planner to reason over past task-observation pairs. The summarizer uses a zero-shot prompted Mistral-7B-Instruct-v0.2 model.


Most modules of the agent are tools and the system can be extended with further tools by implementing a common tool interface. During execution, a tool gets access to the user request, the current task description and the agent’s scratchpad. It is up to the tool implementation to make use of all or only a subset of the provided information.

Function call

The function call tool wraps a user-defined function into a tool interface so that it can be selected by the planner. It binds information from the task description and previous observations to function parameters. The default implementation uses NexusRaven-V2-13B for that purpose, an LLM fine-tuned for function calling from natural language instructions (sample prompt and completion). An alternative implementation uses a zero-shot prompted CodeLlama-7B-instruct model to generate function call arguments from instructions.


The calculate tool generates and executes Python code from a mathematical task description and previous (numerical) observations. The current implementation supports calculations that result in a single number (sample prompt and completion). Python code running this calculation is generated with a CodeLlama-7B-instruct model. Code execution is not sandboxed; use this component at your own risk.

Ask user

This tool is used when the planner needs further input from the user. It doesn’t use an LLM and simply returns the user’s answer to the agent (code).

Respond to user

A tool that generates a final answer to the original request using all previous task-observation pairs (code, sample prompt and completion). This tool uses a zero-shot prompted Mistral-7B-Instruct-v0.2 model for generating the answer.


All models used by the modular agent are running on a llama.cpp server. Instructions for serving these LLMs are available here.

from langchain_experimental.chat_models.llm_wrapper import Llama2Chat

from import MistralInstruct
from gba.llm import LlamaCppClient

# Proxy for 8-bit quantized Mistral-7B-Instruct-v0.2
mistral_instruct = MistralInstruct(
    llm=LlamaCppClient(url="http://localhost:8081/completion", temperature=-1),

# Proxy for 4-bit quantized CodeLlama-7B-Instruct
code_llama = Llama2Chat(
    llm=LlamaCppClient(url="http://localhost:8088/completion", temperature=-1),

# Proxy for 8-bit quantized NexusRaven-V2-13B
nexus_raven = LlamaCppClient(url="http://localhost:8089/completion", temperature=-1)

Custom functions used in this example are create_event for adding events to a calendar, search_internet for searching documents matching a query, and search_images for searching images matching a query. To keep this example simple and to avoid dependencies to external APIs, search_internet searches for documents in a local document store, search_images and create_event are mocked. You can replace them with other implementations or add new functions as you like.

from gba.agent import Agent
from import SearchEngine
from import DocumentStore
from gba.summary import ResultSummarizer
from import *

from example_docs import DOCUMENTS
from example_funcs import create_event

store = DocumentStore(path=".chroma")
engine = SearchEngine(store=store)

if store.count() == 0:
    for i, document in enumerate(DOCUMENTS):
        store.add(identifier=str(i), document=document)

summarizer = ResultSummarizer(model=mistral_instruct)

tools = [
    CalculateTool(model=code_llama, summarizer=summarizer),
    FunctionCallTool(model=nexus_raven, fn=create_event),
    FunctionCallTool(model=nexus_raven, fn=engine.search_images),
    FunctionCallTool(model=nexus_raven, fn=engine.search_internet, summarizer=summarizer),

agent = Agent(model=mistral_instruct, tools=tools)

Multiple tool use"Who is Leonardo DiCaprio's current girlfriend and "
          "what is her age raised to the power of 0.7?")
Task: Search the internet for information about Leonardo DiCaprio's current girlfriend.
Call: search_internet(query='Leonardo DiCaprio\'s current girlfriend')
Observation: Leonardo DiCaprio's current girlfriend is Italian model Vittoria Ceretti.

Task: Search the internet for information about Vittoria Ceretti's age.
Call: search_internet(query='Vittoria Ceretti')
Observation: Vittoria Ceretti is 25 years old.

Task: Calculate the value of 25 raised to the power of 0.7.
result = 25 ** 0.7
Observation: The value of 25 raised to the power of 0.7 is approximately 9.51827.

'The current girlfriend of Leonardo DiCaprio is Vittoria Ceretti, and the value of her age (which is 25) raised to the power of 0.7 is approximately 9.51827.'"How many years did octopuses appear earlier "
          "on Earth, compared to elephants?")
Task: Search the internet for the first appearance dates of octopuses and elephants on Earth.
Call: search_internet(query='first appearance dates of octopuses and elephants on Earth')
Observation: Octopuses first appeared around 155 million years ago during the Cretaceous period.

Task: Search the internet for the first appearance date of elephants on Earth.
Call: search_internet(query='first appearance date of elephants on Earth')
Observation: Elephants first appeared on Earth around 55 million years ago.

Task: Calculate the difference in years between the first appearance of octopuses (around 155 million years ago) and the first appearance of elephants (around 55 million years ago).
result = 155 - 55
Observation: The difference in years between the first appearance of octopuses (around 155 million years ago) and the first appearance of elephants (around 55 million years ago) is approximately 100 million years.

'Octopuses first appeared around 100 million years earlier than elephants.'

Single tool use

More precisely, use the single tool in addition to the respond to user tool."Add Martin's birthday party, starting Dec. 17th 2023 8pm, "
          "to my calendar and respond with a one-line invitation")
Task: Creating Martin's birthday party event on Dec. 17th, 2023 at 8pm and generating a one-line invitation.
Call: create_event(title='Martin\'s birthday party', date='Dec. 17th, 2023', time='8pm')
Observation: Event Martin's birthday party successfully added to calendar, date=Dec. 17th, 2023, time=8pm

"You're invited to Martin's birthday party on Dec. 17th, 2023 at 8pm."

Respond directly"Tell me a joke")
"Why don't scientists trust atoms? Because they make up everything!"

Request user feedback

Asks the user to provide additional input required to complete the request."Tell me a joke about my best friend")
Could you please provide some details about your best friend, such as their name or any quirks they have?:  Michael
Observation: Michael

Could you please tell me any funny or unique traits that your best friend Michael has?:  works at home
Observation: works at home

Task: Searching the internet for a joke about someone who works from home.
Call: search_internet(query='joke about someone who works from home')
Observation: I'm sorry, I couldn't find a joke about someone who works from home in the provided text.

Could you please provide any additional details about your best friend Michael that might help me create a joke?:  eats chickens
Observation: eats chickens

'Why did Michael the chicken farmer laugh while working from home? Because he was raising a good cluck of business!'

Conversational tool use

Stores conversational state in a memory module separated from the agent’s scratchpad. Conversational memory must be explicitly enabled with conversational=True.

conversational_agent = Agent(model=mistral_instruct, tools=tools, conversational=True)"I want an image of a dog")
Task: Searching for an image of a dog...
Call: search_images(query='dog')
Observation: [dog_1.jpg](

'Here is the image you requested: [dog_1.jpg]('"It should be brown")
Task: Searching for an image of a brown dog.
Call: search_images(query='brown dog')
Observation: [brown_dog_1.jpg](

'Here is an image of a brown dog: [brown_dog_1.jpg]('"Find an image with two of them")
Task: Search for an image of two brown dogs.
Call: search_images(query='two brown dogs')
Observation: [two_brown_dogs_1.jpg](

'Here is an image of two brown dogs: [two_brown_dogs_1.jpg]('


The current implementation is very simple and minimalistic. Potential improvements may include but are not limited to:

  • Better decision making via with tree search like in DFSDT or LATS, for example.
  • Self-improvement via self-reflection or action learning, for example.
  • Function or API selection from a large database instead of enumerating them in the planner prompt.
  • Better code LLM for the calculator tool to supporting more complex instructions and result types.
  • Leverage nested and parallel function calls with NexusRaven-V2.

Improvements can be implemented via prompt engineering and/or fine-tuning. A later version may also implement the agent modules as Langchain-compatible chains and tools, and the agent itself as Langchain agent, but in the current experimentation phase I prefer the flexibility of a custom implementation.


Although it’s almost impossible to elicit agent behavior from a 7B open LLM with a monolithic zero-shot prompt, it is possible to achieve that via prompt chaining which distributes the many responsibilities of an LLM agent across specialized modules. They are coordinated by a central planner and can be optimized individually, if needed. Schema-guided generation is an important factor to support reliable data exchange between these modules.

comments powered by Disqus