Under the hood of AI agents: A technical guide to the next frontier of gen AI -

Agents are the trendiest topic in AI today, and with good reason. AI agents act on their users’ behalf, autonomously handling tasks like making online purchases, building software, researching business trends or booking travel. By taking generative AI out of the sandbox of the chat interface and allowing it to act directly on the world, agentic AI represents a leap forward in the power and utility of AI.Taking gen AI out of the protected sandbox of the chat interface and allowing it to act directly on the world represents a leap forward in the power and utility of AI.

Agentic AI has been moving really fast: For example, one of the core building blocks of today’s agents, the model context protocol (MCP), is only a year old! As in any fast-moving field, there are many competing definitions, hot takes and misleading opinions.

To cut through the noise, I’d like to describe the core components of an agentic AI system and how they fit together: It’s really not as complicated as it may seem. Hopefully, when you’ve finished reading this post, agents won’t seem as mysterious.

Agentic ecosystem

Definitions of the word “agent” abound, but I like a slight variation on the British programmer Simon Willison’s minimalist take:

An LLM agent runs tools in a loop to achieve a goal.

The user prompts a large language model (LLM) with a goal: Say, booking a table at a restaurant near a specific theater. Along with the goal, the model receives a list of the tools at its disposal, such as a database of restaurant locations or a record of the user’s food preferences. The model then plans how to achieve the goal and calls one of the tools, which provides a response; the model then calls a new tool. Through repetitions, the agent moves toward accomplishing the goal. In some cases, the model’s orchestration and planning choices are complemented or enhanced by imperative code.

But what kind of infrastructure does it take to realize this approach? An agentic system needs a few core components:

A way to build the agent. When you deploy an agent, you don’t want to have to code it from scratch. There are several agent development frameworks out there.
Somewhere to run the AI model. A seasoned AI developer can download an open-weight LLM, but it takes expertise to do that right. It also takes expensive hardware that’s going to be poorly utilized for the average user.
Somewhere to run the agentic code. With established frameworks, the user creates code for an agent object with a defined set of functions. Most of those functions involve sending prompts to an AI model, but the code needs to run somewhere. In practice, most agents will run in the cloud, because we want them to keep running when our laptops are closed, and we want them to scale up and out to do their work.
A mechanism for translating between the text-based LLM and tool calls.
A short-term memory for tracking the content of agentic interactions.
A long-term memory for tracking the user’s preferences and affinities across sessions.
A way to trace the system’s execution, to evaluate the agent’s performance.

Let’s dive into more detail on each of these components.

Building an agent

Asking an LLM to explain how it plans to approach a particular task improves its performance on that task. This “chain-of-thought reasoning” is now ubiquitous in AI.

The analogue in agentic systems is the ReAct (reasoning + action) model, in which the agent has a thought (“I’ll use the map function to locate nearby restaurants”), performs an action (issuing an API call to the map function), then makes an observation (“There are two pizza places and one Indian restaurant within two blocks of the movie theater”).

ReAct isn’t the only way to build agents, but it is at the core of most successful agentic systems. Today, agents are commonly loops over the thought-action-observation sequence.

The tools available to the agent can include local tools and remote tools such as databases, microservices and software as a service. A tool’s specification includes a natural-language explanation of how and when it’s used and the syntax of its API calls.

The developer can also tell the agent to, essentially, build its own tools on the fly. Say that a tool retrieves a table stored as comma-separated text, and to fulfill its goal, the agent needs to sort the table.

Sorting a table by repeatedly sending it through an LLM and evaluating the results would be a colossal waste of resources — and it’s not even guaranteed to give the right result. Instead, the developer can simply instruct the agent to generate its own Python code when it encounters a simple but repetitive task. These snippets of code can run locally alongside the agent or in a dedicated secure code interpreter tool.

Available tools can divide responsibility between the LLM and the developer. Once the tools available to the agent have been specified, the developer can simply instruct the agent what tools to use when necessary. Or, the developer can specify which tool to use for which types of data, and even which data items to use as arguments during function calls.

Similarly, the developer can simply tell the agent to generate Python code when necessary to automate repetitive tasks or, alternatively, tell it which algorithms to use for which data types and even provide pseudocode. The approach can vary from agent to agent.

Runtime

Historically, there were two main ways to isolate code running on shared servers: Containerization, which was efficient but offered lower security; and virtual machines, which were secure but came with a lot of computational overhead.

In 2018, Amazon Web Services’ (AWS’s) Lambda serverless-computing service deployed Firecracker, a new paradigm in server isolation. Firecracker creates “microVMs”, complete with hardware isolation and their own Linux kernels but with reduced overhead (as low as a few megabytes) and startup times (as low as a few milliseconds). The low overhead means that each function executed on a Lambda server can have its own microVM.

However, because instantiating an agent requires deploying an LLM, together with the memory resources to track the LLM’s inputs and outputs, the per-function isolation model is impractical. Instead, with session-based isolation, every session is assigned its own microVM. When the session finishes, the LLM’s state information is copied to long-term memory, and the microVM is destroyed. This ensures secure and efficient deployment of hosts of agents.

Tool calls

Just as there are several existing development frameworks for agent creation, there are several existing standards for communication between agents and tools, the most popular of which — currently — is the model context protocol (MCP).

MCP establishes a one-to-one connection between the agent’s LLM and a dedicated MCP server that executes tool calls, and it also establishes a standard format for passing different types of data back and forth between the LLM and its server.

Many platforms use MCP by default, but are also configurable, so they will support a growing set of protocols over time.

Sometimes, however, the necessary tool is not one with an available API. In such cases, the only way to retrieve data or perform an action is through cursor movements and clicks on a website. There are a number of services available to perform such computer use. This makes any website a potential tool for agents, opening up decades of content and valuable services that aren’t yet available directly through APIs.

Authorizations

With agents, authorization works in two directions. First, of course, users require authorization to run the agents they’ve created. But as the agent is acting on the user’s behalf, it will usually require its own authorization to access networked resources.

There are a few different ways to approach the problem of authorization. One is with an access delegation algorithm like OAuth, which essentially plumbs the authorization process through the agentic system. The user enters login credentials into OAuth, and the agentic system uses OAuth to log into protected resources, but the agentic system never has direct access to the user’s passwords.

In the other approach, the user logs into a secure session on a server, and the server has its own login credentials on protected resources. Permissions allow the user to select from a variety of authorization strategies and algorithms for implementing those strategies.

Memory and traces

Short-term memory

LLMs are next-word prediction engines. What makes them so astoundingly versatile is that their predictions are based on long sequences of words they’ve already seen, known as context. Context is, in itself, a kind of memory. But it’s not the only kind an agentic system needs.

Suppose, again, that an agent is trying to book a restaurant near a movie theater, and from a map tool, it’s retrieved a couple dozen restaurants within a mile radius. It doesn’t want to dump information about all those restaurants into the LLM’s context: All that extraneous information could wreak havoc with next-word probabilities.

Instead, it can store the complete list in short-term memory and retrieve one or two records at a time, based on, say, the user’s price and cuisine preferences and proximity to the theater. If none of those restaurants pans out, the agent can dip back into short-term memory, rather than having to execute another tool call.

Long-term memory

Agents also need to remember their prior interactions with their clients. If last week I told the restaurant booking agent what type of food I like, I don’t want to have to tell it again this week. The same goes for my price tolerance, the sort of ambiance I’m looking for, and so on.

Long-term memory allows the agent to look up what it needs to know about prior conversations with the user. Agents don’t typically create long-term memories themselves, however. Instead, after a session is complete, the whole conversation passes to a separate AI model, which creates new long-term memories or updates existing ones.

Memory creation can involve LLM summarization and “chunking”, in which documents are split into sections grouped according to topic for ease of retrieval during subsequent sessions. Available systems allow the user to select strategies and algorithms for summarization, chunking and other information-extraction techniques.

Observability

Agents are a new kind of software system, and they require new ways to think about observing, monitoring and auditing their behavior. Some of the questions we ask will look familiar: Whether the agents are running fast enough, how much they’re costing, how many tool calls they’re making and whether users are happy. But new questions will arise, too, and we can’t necessarily predict what data we’ll need to answer them.

Observability and tracing tools can provide an end-to-end view of the execution of a session with an agent, breaking down step-by-step which actions were taken and why. For the agent builder, these traces are key to understanding how well agents are working — and provide the data to make them work better.

I hope this explanation has demystified agentic AI enough that you’re willing to try building your own agents!