If you scroll through any production LLM system long enough, you start seeing the same seven operations re-implemented in slightly different ways across the code base:
- Categorize this thing into one of N buckets.
- Rate this thing on a 1–10 axis with a reason.
- Write a reply, draft, or post in this tone.
- Compress a long thing into a short thing.
- Pull typed fields out of unstructured text.
- Break a goal into ordered steps.
- Look at evidence and return a recommendation with caveats.
Those are the seven I keep seeing. In my own code base, the recurring LLM work collapsed into one of these shapes. Most teams discover it the third or fourth time they write classifyEmail, scoreLead, triagePR, categorizeTicket and notice they're all the same function with different prompts.
There are real operations that don't fit cleanly: translation, redaction, reranking, free-form Q&A. Those matter, but they aren't the core pattern I kept seeing. The seven below are the load-bearing shape; the rest are specializations the same constructor pattern handles when they ship (see section 6).
This article is about the pattern that comes after that realization: capability factories. They are to LLM prompts what class Repository<T> is to database queries, or what reusable services are to repeated business logic: a typed, reusable, observable shape that captures what you want done without tying you to which model does it.
I extracted this pattern out of a production system. The TypeScript implementation is now an open-source library called llm-ports. The article below is more about the shape than the library. I'll show the library at the end so you can see how the abstraction looks when it ships.
1. Why the same seven keep showing up#
The first system I built with LLMs was an executive assistant called BEPA. It triaged email, drafted replies, scored business leads, summarized meeting notes, extracted action items, planned multi-step research projects, and analyzed weekly metrics.
After about 18 months I sat down with a grep window open and counted what was actually happening in the code:
generateText() calls reading {role, content} : 47,000
.parse() calls on Zod schemas of generateObject : 34,000
"Reply in the user's tone" prompt fragments : 12,000
"Rate this 1-10 with reason" prompt fragments : 9,000
"Extract X from Y, return JSON" : 18,000
"Plan this in steps, return ordered list" : 4,000
"Analyze X and recommend" : 6,000Every one of those collapsed to one of seven shapes. The variation was which Zod schema, which rubric, which tone, which prompt-engineering tricks happened to work for that model that week.
The interesting part is that the call shape (inputs, outputs, validation, observability) was identical inside each shape. Only the content varied. Which is exactly the precondition for extracting a reusable abstraction.
💡 "You're not writing 47 prompts. You're writing 7 prompts, 47 times, with slightly different ingredients."

2. The seven, defined#
These are the seven cognitive operations I extracted. Two ground rules: (a) each one has a deterministic call shape regardless of model, and (b) each one has a fixed return shape: its Zod schema is a property of the capability, not the call. If your operation doesn't fit one of these cleanly, you probably have two operations mashed together (in which case split them) or you have a specialization the seven don't cover (in which case see section 6).
Capability | What You Give It | What You Get Back | When It Pays Off |
|---|---|---|---|
Classify | Content + rubric | One label from an enum + reasoning | Triage queues, routing |
Score | Content + rubric + axes | Numeric ratings per axis | Lead scoring, quality grading |
Draft | Persona + situation + reference | Longer text in a chosen tone | Replies, posts, briefings |
Summarize | Long content + length target | Shorter content, key points preserved | Thread digests, briefings |
Extract | Unstructured text + schema | A typed structured object | Invoice fields, contact info, action items |
Plan | Goal + constraints + tools | An ordered list of steps | Research tasks, multi-step workflows |
Analyze | Evidence + question | Recommendation with caveats and confidence | Judgment calls, options memos |
Notice three structural patterns across the table:
- Every capability has a Zod return type that doesn't depend on the prompt. A classifier always returns { label, reasoning }. A scorer always returns { axes: Record<string, { score, reason }> }. The schema is a property of the capability, not the call.
- Every capability has a rubric or persona that comes from data, not from code. A classifyEmail and a classifyPullRequest are the same capability, parameterized by different rubrics.
- Every capability is "one shot, one schema, one decision." They are not agents. They do not loop. They do not call tools. Capabilities are the unit of cognition; agents compose capabilities.
The boundary line that matters: a capability ends when a typed value comes back. Anything that loops, branches, or calls tools is an agent built on top of capabilities. Don't blur the two.

3. Why scattered prompts are technical debt, not flexibility#
A reasonable objection: "Why not just write each prompt where it's used? Less abstraction is better than more."
I made that argument to myself for the first year of building BEPA. The cost came due in three places:
3.1. Drift#
Six developers wrote six "score this lead 1-10" prompts over a year. Half used 1-10, half used 1-5. Three included a reasoning field; one returned reasoning as part of the same string. Two prompts caused the model to occasionally return "score": "high" instead of "score": 9, which then crashed the consumer.
Eventually someone refactored them all to a shared shape, but only after a P1 from a customer-facing surface returning a string where a number was expected.
3.2. Model migration#
When we rotated the underlying model from one Claude minor to another (or briefly tried a Cerebras-hosted reasoning model for speed), every prompt that had been tuned to the old model's quirks broke a little differently. Because the prompts were inlined at each call site, the fix was to edit 47 places.
The capability factory version of that same migration was: edit one rubric file, rerun the test suite that scored 50 known-good examples against the factory, ship.
3.3. Observability#
When the CTO asks "what's our LLM spend on triage, broken out by category, last month?" the answer requires either (a) instrumenting every call site individually, or (b) having one place where every classify call passes through.
Capability factories give you (b) for free. Inline prompts force you toward (a) and you eventually do half of (a) and lose the rest.
4. Capability factories as system assets#
The goal isn't to make the call prettier. The goal is to move schema, routing, cost, and observability out of scattered feature code and into one reusable definition.
A capability factory is just a constructor that takes the invariant parts of a capability (schema, rubric/persona, model routing, hooks) and returns a function that takes the varying parts (the actual content) and returns a typed result.
In TypeScript, that constructor looks like this:
import { createClassifier } from "@llm-ports/capabilities";
import { z } from "zod";
const PriorityClassification = z.object({
priority: z.enum(["P0", "P1", "P2", "P3"]),
category: z.enum(["bug", "feature", "question", "other"]),
reasoning: z.string().min(20),
});
export const classifyIncomingRequest = createClassifier({
port: llm, // your LLM port (provider-agnostic)
schema: PriorityClassification,
schemaName: "incoming-request-triage",
rubric: `
P0: prod-broken or customer-blocking; reply within 1 hour
P1: significant business impact; same-day
P2: standard professional ask; within 2 days
P3: nice-to-have or FYI; no SLA
`,
onResult: async (event) => {
await metrics.track({
capability: event.capability,
cost: event.cost.totalUSD,
latencyMs: event.latencyMs,
validationAttempts: event.validationAttempts,
});
},
});And then at every call site (across 47 files), the call is the same shape:
const triage = await classifyIncomingRequest({ content: ticket.subject + "\n\n" + ticket.body });
// ^? { priority: "P0"|"P1"|"P2"|"P3"; category: "bug"|...; reasoning: string }What the factory does for you, that inline prompts don't:
- Bad model output stops at the capability boundary. The Zod schema is the source of truth. If the model returns invalid JSON or a wrong enum value, the library retries with a correction prompt automatically (a strategy called retry-with-feedback). The application code never sees the bad attempt.
- Cost is attributed per task, not discovered after the invoice. Every call reports cost.totalUSD computed against the configured pricing table, plus validationAttempts so you can see when models are getting it wrong on the first try. (The pricing table itself is point-in-time: provider rate changes, cache-token discounts, and tier shifts can drift it relative to actual invoices. Reconcile against your provider bill if you need to-the-cent accuracy.)
- Observability is wired once, not copied into every call site. onResult fires for every successful call, onError for failures, onBeforeCall for redaction or argument logging. You wire them once at factory construction; every call site benefits.
- Provider choice moves out of business logic. Notice port: llm instead of a hardcoded openai("gpt-4o"). The factory routes through whatever your registry has wired. You can swap providers without touching the factory or its 47 call sites.
The seven shapes give you seven such factories. In @llm-ports/capabilities they are createClassifier, createScorer, createDrafter, createSummarizer, createExtractor, createPlanner, createAnalyzer. Each has the same shape: schema, rubric/persona, hooks, model routing.
These seven are the load-bearing pattern, not the entire universe. The roadmap adds specialized factories on the same shape over time: tag (multi-label classify), detect (boolean classify), expand, rewrite, redact, respond, decide, answer, rerank, and more. New shapes adopt the same constructor pattern, so a call site that uses createClassifier today won't need a different mental model when createReranker ships.
💡 "The capability is the noun. The prompt is an implementation detail."
5. How this fits with llm-ports#
A capability needs to call some LLM. Where does the model come from?
@llm-ports/core provides an LLMPort interface that hides the provider behind a typed contract. The capability factory imports it; you can wire any adapter that satisfies the port:
import { createRegistryFromEnv } from "@llm-ports/core";
import { createAnthropicAdapter } from "@llm-ports/adapter-anthropic";
import { createOpenAIAdapter } from "@llm-ports/adapter-openai";
const registry = createRegistryFromEnv({
adapters: {
anthropic: createAnthropicAdapter({ apiKey: process.env.ANTHROPIC_API_KEY! }),
openai: createOpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
},
});
export const llm = registry.getPort();And your .env says which providers handle which task types and in what fallback order:
LLM_PROVIDER_FAST=anthropic|claude-haiku-4-5|cost:5/day
LLM_PROVIDER_SMART=anthropic|claude-sonnet-4-6|cost:50/day
LLM_PROVIDER_BACKUP=openai|gpt-4o|cost:10/day
LLM_TASK_ROUTE_TRIAGE=fast,backup
LLM_TASK_ROUTE_DRAFT=smart,backup
LLM_TASK_ROUTE_GENERAL=fast,smart,backupWhen classifyIncomingRequest runs, it picks the first provider that's within its USD budget and routes through it. If Claude Haiku's daily budget is gone, it walks to GPT-4o. If a 5xx fires, in v0.2 it will walk to the next provider automatically. The application code (the 47 call sites) sees none of this.
You get four properties that scattered prompts can't give you:
- Switching providers is a config change. Edit .env. No code changes anywhere.
- Cost is bounded. Budget exhaustion is a typed exception (BudgetExceededError), not a surprise invoice.
- Schema drift is contained. Schemas are constraints, not suggestions. Bad model output is auto-retried with feedback; if the retry also fails, you get a typed ValidationError instead of garbage data leaking downstream. Drift in rubrics, schema versions, and prompt phrasing still happens; it just happens in one place per capability, not scattered across 47 call sites.
- Capabilities compose. Build an agent or workflow by calling capabilities in sequence; each call is observable and budgeted independently.
6. What's in the box#
@llm-ports ships as a set of small packages so you only install what you need:
# core: the LLMPort interface, registry, cost gating
pnpm add @llm-ports/core
# pick at least one adapter
pnpm add @llm-ports/adapter-anthropic @anthropic-ai/sdk
pnpm add @llm-ports/adapter-openai openai
pnpm add @llm-ports/adapter-ollama ollama # local LLMs
pnpm add @llm-ports/adapter-vercel ai # Vercel AI SDK migration
# the seven capabilities
pnpm add @llm-ports/capabilities
# all need zod 3.24+ as a peer dep
pnpm add zodTotal install footprint (core + one adapter + capabilities) is under 1.5 MB unpacked. Zero LangChain dependencies. Strict TypeScript.
The repo ships examples for the adoption paths that matter most (basic classification, multi-provider routing, local models, Vercel AI migration, and live integration tests), alongside a docs site with concept guides, adapter reference pages, and per-capability deep dives.
The package is pre-1.0 (currently in alpha at v0.1.0-alpha.2). The architecture is stable and the offline regression suite is comprehensive, but the public surface may still see minor adjustments before v0.1 stable; check the v0.1 status page for the per-surface inventory of what's stable today vs. still being hardened.
If the capability-factory pattern resonates with how you're building, I'd love feedback in GitHub Discussions. What shapes are you re-implementing that aren't on the list? What knobs do the seven need that they don't have today?
Why this matters beyond code reuse#
Capability factories are not only about cleaner prompts.
They create a shared boundary for how AI work enters the system.
Once classification, scoring, drafting, summarization, extraction, planning, and analysis all pass through typed capability factories, the organization gets a place to attach policy:
- which model can handle which task
- how much each task is allowed to cost
- what schema defines a valid result
- what gets logged for quality review
- what gets retried, rejected, or escalated
That does not make capability factories a full governance system by themselves. They do not replace access control, audit infrastructure, redaction policy, or human approval workflows.
But they create the control surface those systems need.
You cannot govern hundreds of scattered prompts. You can govern a small set of typed capabilities.
💡 "Stop writing 47 prompts. Start writing 7 capabilities, each with a rubric you can version."
TL;DR#
- Most production LLM systems eventually reimplement the same seven cognitive operations: classify, score, draft, summarize, extract, plan, analyze.
- Inlining the prompts at each call site causes drift, model-migration pain, and observability holes.
- A capability factory lifts the invariant parts (schema, rubric, hooks, model routing) into a constructor; the call sites only vary the content.
- The shape is provider-agnostic. Pair it with an LLMPort abstraction (any adapter that satisfies the contract) and switching providers becomes an .env change.
- An MIT-licensed TypeScript implementation lives at @llm-ports/capabilities. The 7 foundational factories are 100% Zod-typed, with hooks for analytics and observability; specialized factories follow the same shape as they ship.
