Back to blog
5 min read

Agentic AI FinOps: stop runaway inference spend without killing quality

A practical runbook for tagging agent workflows, capping loops, and reviewing burn when token prices fall but bills keep climbing.

By My Eco Token Team

  • finops
  • agents
  • cost
  • ops

Your token unit costs dropped this year. Your AI program budget did not. That gap is the signal that your problem is no longer model pricing—it is architecture.

Agentic workflows can chain ten to twenty model calls per user-visible task. Always-on monitors add baseline spend when nobody is in the product. Finance sees a hockey stick; engineering sees the same dashboards that worked for chat pilots. This runbook is for teams that need control this quarter, not a strategy deck next year.

Why agentic spend breaks traditional AI budgets

Chat pilots fail quietly on cost because usage scales with humans. Agents scale with events, retries, tools, and background jobs. A single support automation can fan out into classification, retrieval, drafting, validation, and logging—each step billable.

Three patterns show up in almost every postmortem:

  • Loop depth without a hard stop: the agent keeps calling tools until something succeeds.
  • Always-on monitors: health checks and classifiers run 24/7 even when traffic is flat.
  • Shared keys with no tags: finance gets one invoice, engineering cannot attribute burn to a workflow.

Traditional "turn down the temperature" advice does not touch these drivers. You need workflow-level FinOps.

Step 1: Tag every production request before it hits a model

If you cannot attribute spend, you cannot govern it. Minimum metadata schema:

  • team_id and cost_center
  • product or feature_id
  • workflow_name (for example support_triage, code_review_bot)
  • environment (production vs staging)
  • intent (exploration vs production)

Operational rule: block untagged calls in production. Allow missing tags only in personal sandboxes with low caps.

Implementation checklist:

  • Add headers or fields your gateway already forwards—do not rely on post-hoc log parsing alone.
  • Validate tags in CI for services that call models.
  • Publish the schema in one internal doc; refuse one-off exceptions without an expiry date.

Step 2: Split exploration and production envelopes

Exploration needs freedom; production needs predictability. Create two budget types:

  • Per-user exploration credits with weekly reset and premium models allowed.
  • Per-workflow production envelopes tied to an owner and a monthly cap.

When a production envelope hits 80% burn, alert the owner—not only finance. At 100%, downgrade route policy automatically or require approval to continue on premium models.

Not ideal when: your entire product is one undifferentiated chat surface. In that case, tag by conversation topic first, then split envelopes once traffic patterns are visible.

Step 3: Cap agent loop depth and log every step

Agents need guardrails that are technical, not cultural.

  • Set max_steps per run (start with 8–12 for tool-heavy flows; tune from data).
  • Set max_daily_spend per agent identity.
  • Log each step: model, route reason, input tokens, output tokens, latency, status.

Review the top ten agent identities by cost every Friday. For any agent above 2x its seven-day median, open a twenty-minute review: Is the loop necessary? Can retrieval be cached? Can a smaller model handle the middle steps?

Step 4: Measure outcomes, not tokens alone

Token charts answer "what happened?" Outcome metrics answer "was it worth it?"

Pick one outcome metric per workflow before changing models:

  • Support: cost per resolved ticket
  • Sales enablement: cost per accepted draft
  • Engineering: cost per merged change with human edit distance below a threshold

Route down only when outcome quality is flat for two review cycles. If human rework rises, roll back the policy and document why.

Quick facts

  • Agentic inference is often the largest line item once chat pilots go to production.
  • Attribution tags must be required in production, not optional in logs.
  • Loop caps and per-agent daily limits prevent the fastest budget fires.
  • Outcome metrics protect quality while you optimize cost.
  • Weekly workflow reviews beat monthly invoice surprises.

FAQ

Do we need a new FinOps hire? Not necessarily. Start with a weekly thirty-minute forum: platform, one product owner, and finance. Review tagged burn and outcome metrics.

Should we ban premium models? No. Concentrate premium models on high-value paths; route drafts, classification, and extraction to economical models with explicit policy.

What if tags are wrong? Treat tag quality like schema migrations: fix forward, backfill where possible, and block deploys that drop required fields.

Can we fix this in one sprint? Tagging and envelopes are often a week of work. Policy tuning is ongoing—plan for two-week review cadences, not a one-time project.

Your first 30 minutes

Export yesterday's inference log (or enable logging if missing). Filter production traffic. Count what percentage has all required tags—if it is under 90%, stop other work and fix tagging first.

Pick your highest-cost workflow. Assign an owner. Set an 80% alert on a dedicated envelope. Add max_steps to the agent if applicable. Schedule a Friday review. That four-action block is enough to prove whether agentic FinOps is working before you redesign the whole stack.

If you use a unified credits dashboard, mirror these envelopes there so finance and engineering see the same numbers in the same meeting.