Building AI agents is no longer a fringe activity reserved for research labs. Enterprises, scale-ups, and even small B2B teams are deploying agents to automate sales workflows, handle customer queries, process documents, and coordinate complex multi-step tasks. But before you can run agents, you need to build them — and that means choosing the right tools.
The landscape is crowded. Between frameworks, orchestration layers, memory stores, and deployment platforms, there are dozens of decisions to make before you write a single line of business logic. Get it wrong and you're locked into something that can't scale. Get it right and you have an automation backbone that compounds in value over time.
This guide breaks down the major components of a modern agent stack and compares the leading options at each layer. Whether you're evaluating your first deployment or auditing an existing setup, this is your map.
What Is an "Agent Stack"?
Think of an agent stack the same way you'd think about a web application stack — it's the collection of technologies that work together to make your agents function. A typical stack includes:
- LLM Provider — the underlying language model(s) powering reasoning and generation
- Agent Framework — the code layer that defines agent behaviour, tools, and decision logic
- Orchestration Layer — how multiple agents are coordinated and how tasks are routed
- Memory & State — how agents retain context, both short-term (within a session) and long-term (across sessions)
- Tool Integrations — the APIs, databases, and services agents can call
- Deployment & Monitoring — where agents run and how you observe them
Each layer has its own set of options. The right choices depend on your team's engineering maturity, the complexity of your use cases, and how fast you need to move.
Layer 1: LLM Providers
The model is the engine. Everything else in the stack exists to direct its intelligence.
OpenAI (GPT-4o, o1, o3)
The default choice for most teams. Strong reasoning, excellent tool-calling support, wide library compatibility, and the most mature API. GPT-4o is a strong all-rounder; the o-series models shine on complex multi-step reasoning tasks. Cost can add up at scale, and vendor lock-in is a real concern.
Best for: Teams that want to move fast and don't want to manage model infrastructure.
Anthropic (Claude Sonnet, Haiku, Opus)
Claude models are particularly strong at instruction-following, long-context tasks, and safety-critical workflows. Sonnet hits a good quality-to-cost ratio for production workloads. The Constitutional AI approach makes Claude more predictable in behaviour, which matters in enterprise contexts.
Best for: Document-heavy workflows, compliance-sensitive applications, long-context reasoning.
Google (Gemini 2.0/2.5)
Gemini Pro and Flash offer strong multimodal capabilities and tight integration with Google Workspace tools. Gemini 2.5 Flash is competitive on price/performance. Native integration with Google Cloud makes deployment straightforward for teams already in that ecosystem.
Best for: Enterprises in the Google ecosystem, multimodal use cases, high-volume low-cost tasks.
Open-Source (Llama 3, Mistral, Qwen)
Models you can run on your own infrastructure. The obvious advantage is data privacy — nothing leaves your environment. Llama 3.3 70B is surprisingly capable for a self-hosted option. The trade-off is engineering overhead: you need to manage hosting, scaling, and model updates.
Best for: Regulated industries, privacy-first deployments, teams with ML infrastructure experience.
The practical approach: Most production stacks use a mix. A high-capability model for complex reasoning tasks, a fast/cheap model for high-volume simple tasks, and possibly an open-source model for anything that can't leave the building.
Layer 2: Agent Frameworks
This is where you define how your agents think, act, and use tools.
LangChain / LangGraph
LangChain is the most widely adopted agent framework, largely because it was first to market with a comprehensive abstraction layer. LangGraph, its newer sibling, introduces a graph-based execution model that gives you fine-grained control over agent flow — cycles, conditional routing, and state persistence are all first-class citizens.
Strengths: Mature ecosystem, extensive integrations, large community, good documentation.
Weaknesses: Abstraction overhead can make debugging painful. Early versions were notorious for leaky abstractions. LangGraph adds complexity that smaller teams may not need.
Best for: Teams who want a battle-tested framework with a large support ecosystem.
AutoGen (Microsoft)
AutoGen takes a conversational multi-agent approach. Agents communicate with each other through structured dialogue — you define roles (a coder, a reviewer, a planner) and they negotiate to complete tasks. It's particularly strong for tasks that benefit from agent debate and iterative refinement.
Strengths: Strong multi-agent coordination, good for code generation and review loops, active Microsoft support.
Weaknesses: The conversational model can be verbose and expensive if not carefully constrained. Less flexible for non-dialogue workflows.
Best for: Development automation, code review pipelines, research tasks that benefit from multiple perspectives.
CrewAI
CrewAI presents a role-based agent model — you define a crew of agents, each with a specific role, goal, and backstory, then assign them tasks. The abstraction maps well to how humans think about team coordination. It's quick to prototype but can become rigid at scale.
Strengths: Intuitive mental model, fast to prototype, good for structured workflows with clear role separation.
Weaknesses: Less control over low-level execution, can feel like a black box, limited flexibility for complex branching logic.
Best for: Prototype-phase projects, teams with limited ML engineering expertise.
Semantic Kernel (Microsoft)
Semantic Kernel is an enterprise-grade SDK for integrating LLMs into existing .NET, Java, or Python applications. It's less "build an agent from scratch" and more "add AI capabilities to your existing software." The plugin model maps well to enterprise integration patterns.
Strengths: Enterprise-friendly, strong .NET support, good integration with Microsoft 365 and Azure.
Weaknesses: More opinionated architecture, steeper learning curve, less suited to pure Python shops.
Best for: Enterprises with existing Microsoft infrastructure, teams building AI features into enterprise software.
Pydantic AI
A newer entrant that leans into Python type safety. Agents are defined with strict input/output schemas using Pydantic models, making the integration between LLM outputs and application code much cleaner. Less batteries-included than LangChain but significantly less magic.
Strengths: Type safety, clean integration with Python applications, minimal abstraction overhead.
Weaknesses: Newer ecosystem, fewer integrations, requires more manual wiring.
Best for: Python-native teams who prioritise code quality and predictable interfaces over rapid scaffolding.
Layer 3: Orchestration Platforms
As you move from single agents to multi-agent systems, you need something to coordinate them at scale.
OpenAI Assistants API
OpenAI's managed infrastructure for building agents — includes built-in tool use (code interpreter, file search), thread management (persistent conversation memory), and structured handoffs. The trade-off is vendor lock-in and less control over execution.
Best for: Teams who want a managed, opinionated solution without managing infrastructure.
LangSmith + LangGraph Cloud
LangSmith is LangChain's observability platform; LangGraph Cloud is its managed deployment layer. Together they provide a full lifecycle environment: build in LangGraph, observe in LangSmith, deploy to cloud. Strong developer experience for teams already on the LangChain stack.
Best for: Teams building with LangGraph who want integrated observability and deployment.
AWS Bedrock Agents
Amazon's managed multi-agent framework sits on top of Bedrock's model access layer. Native integration with S3, Lambda, DynamoDB, and the broader AWS ecosystem. Supports agent-to-agent handoffs and built-in knowledge base retrieval.
Best for: Enterprises already on AWS who want managed agents without building orchestration from scratch.
Azure AI Foundry (formerly AI Studio)
Microsoft's platform for building, evaluating, and deploying AI agents at enterprise scale. Tight integration with Azure OpenAI Service, Semantic Kernel, and Microsoft 365 Copilot ecosystem. Strong governance and compliance tooling.
Best for: Enterprises in the Microsoft/Azure ecosystem building production-grade agent workflows.
Custom / Self-Hosted
For teams with specific requirements — full data control, custom routing logic, proprietary models — building your own orchestration layer remains a valid option. Tools like Ray, Celery, or custom Kubernetes-based setups give you maximum control at the cost of significant engineering investment.
Best for: Regulated industries, enterprises with complex existing infrastructure, teams with dedicated platform engineering capacity.
Layer 4: Memory & State
Agents without memory are stateless tools. Agents with memory become increasingly useful over time.
| Memory Type | What It Is | Example Tools |
|---|---|---|
| In-context | Information passed directly in the prompt (within token limit) | All LLMs natively |
| External Short-term | Session or conversation state stored outside the model | Redis, DynamoDB |
| Long-term Semantic | Embeddings stored in a vector DB, retrieved by relevance | Pinecone, Weaviate, pgvector |
| Structured State | Relational data the agent can query and update | PostgreSQL, MySQL |
| Knowledge Graph | Entity and relationship-aware retrieval | Neo4j, Cognee |
For most B2B applications, a combination of Redis (fast session state) and a vector store (semantic long-term memory) covers the majority of use cases. Adding a relational database gives you transactional state for workflow-critical data.
Layer 5: Tool Integrations
An agent's power is proportional to the quality of its tools. Common integration categories:
- Web search — Brave Search, Tavily, Bing API
- Code execution — E2B, Daytona, Docker-based sandboxes
- Database access — direct SQL, ORM wrappers, natural language to SQL
- Browser automation — Playwright, Browserbase, Stagehand
- Document processing — LlamaParse, Unstructured, Docling
- API connectors — HubSpot, Salesforce, Slack, Jira, Linear
- File storage — S3, Google Drive, SharePoint
The key design principle: tools should be narrowly scoped with clean interfaces. A tool that does one thing well is safer, more reliable, and easier to debug than a Swiss Army knife tool with unpredictable behaviour.
Layer 6: Deployment & Observability
You can't run agents in production without knowing what they're doing.
Observability tools:
- LangSmith — tight LangChain integration, good trace visualisation
- Langfuse — open-source, model-agnostic, strong cost tracking
- Arize Phoenix — strong for ML teams, good evaluation tooling
- OpenTelemetry — standard-compliant, plugs into existing monitoring stacks
Deployment considerations:
- Agents typically run as async tasks — use queues (SQS, RabbitMQ, Celery) rather than synchronous HTTP
- Plan for retry logic — LLM calls fail, tools time out, networks drop
- Rate limiting and cost caps are non-negotiable in production
- Human-in-the-loop approval gates are essential for high-stakes actions
Choosing Your Stack: A Decision Framework
There's no universal right answer. Here's how to think through your choices:
1. Start with your constraints. Data sovereignty requirements, existing cloud provider, team language/framework preferences — these eliminate options faster than any feature comparison.
2. Match complexity to maturity. CrewAI or OpenAI Assistants for early-stage exploration. LangGraph or Pydantic AI as you need more control. Custom orchestration only when you've outgrown managed solutions.
3. Optimise for observability from day one. Whatever framework you choose, instrument it early. Debugging agents without traces is close to impossible.
4. Plan for model swaps. The model landscape is moving fast. Avoid deep coupling to any single provider's API. Abstraction layers (LangChain, LiteLLM) make model swaps less painful.
5. Validate tools in isolation before composing them. Test every tool your agents will use independently before wiring them together. Tool failures cascade in unpredictable ways.
What the Market Looks Like Right Now
A few patterns are emerging as production deployments mature:
- LangGraph + LangSmith is becoming the default for Python-native engineering teams building complex multi-agent workflows
- OpenAI Assistants API dominates for rapid prototyping and lightweight single-agent use cases
- AWS Bedrock / Azure AI Foundry are gaining ground in enterprise contexts where governance and existing cloud contracts matter
- Open-source models + self-hosted orchestration is growing fast in regulated industries (finance, healthcare, legal)
The meta-trend: the boundary between "agent framework" and "deployment platform" is blurring. Vendors who started as frameworks are building cloud infrastructure; cloud providers are building higher-level agent abstractions. In 18 months, the stacks that exist today will look materially different.
The Bottom Line
The agent stack decision is architectural — it compounds. The tools you choose today shape what's easy and what's painful two years from now.
For most B2B teams starting out:
- LLM: OpenAI GPT-4o or Anthropic Claude Sonnet as primary; a fast/cheap model for high-volume tasks
- Framework: LangGraph if you need complex multi-agent workflows; CrewAI for faster prototyping
- Orchestration: Managed cloud service if you're on AWS/Azure; roll your own only if you have platform engineering capacity
- Memory: Redis for session state + pgvector or Pinecone for semantic retrieval
- Observability: Langfuse (open-source, cost-effective) or LangSmith if you're on LangChain
The most important thing isn't picking the perfect stack — it's picking a coherent one and instrumenting it well so you can learn and iterate.
Agent infrastructure is a means, not an end. The end is business value: faster processes, better customer experiences, more leverage per head. Choose tools that get you there, not tools that look impressive in a pitch deck.
Related Articles: