Part
3
  |  
AI in Production
  |  
Chapter
10

API Cost Optimization: How I Cut Spend by 60–80% on Real Workflows

Most teams discover their API bills the way I discover mine: an unexpectedly large invoice. Then they spend an afternoon panicking, find one obvious cut, and stop.
Reading Time
11
mins
BACK TO n8n Workflow ENgineer

The obvious cut is usually swapping GPT-4o for GPT-4o-mini on a single node and declaring victory. That might shave twenty percent off. It is not enough.

The reductions I aim for — sixty, seventy, eighty percent — come from refusing to make the call in the first place.

They come from caching, from batching, from conditional execution, and from what I call the cheap-first chain. These are not optimizations you bolt on later; they are design decisions you make before the workflow ever hits production. I have rebuilt enough workflows after bill shock to recognize the same four or five leaks every time. This is how I fix them.

The Cheap-First Chain

Framework · The cheap-first chain

Start with the cheapest model or API that could possibly do the job. Escalate to the expensive one only when the cheap one fails, returns a parse error, or hits a confidence threshold you define.

Most tutorials on LLM routing get this backwards. They show you the powerful model first, then mention in a footnote that a smaller one exists. In production, that means every single request burns the full rate. I do the opposite. I default to the small model for classification, extraction, routing, and yes/no decisions. Only the minority of requests that actually need long-form generation or complex reasoning see the large model.

On a support pipeline handling roughly a thousand tickets a day, classifying every ticket with GPT-4o costs about thirty dollars. Routing them through GPT-4o-mini for classification, then escalating only the two hundred tickets that need a custom reply to GPT-4o, drops the daily cost to six dollars and ten cents. That is an eighty percent reduction before you have done anything clever with caching or batching.

In n8n, I implement this with two AI nodes and an IF node. The first node uses a small model with a tightly constrained prompt and a max token limit of ten or twenty.

{
  "resource": "chat",
  "model": "gpt-4o-mini",
  "messages": [
    {
      "role": "system",
      "content": "Classify this ticket into billing, technical, feature_request, or spam. Reply with exactly one word."
    },
    {
      "role": "user",
      "content": "={{ $json.emailBody }}"
    }
  ],
  "temperature": 0,
  "maxTokens": 10
}

If the classification indicates the ticket needs human-level reasoning — or if the small model returns a format I cannot parse — the workflow branches to the second node using the large model. If not, it routes to a template response and stops. The large model is the exception, not the rule.

The legitimate edge case is real: some workflows genuinely need the large model first. Complex contract analysis, nuanced sentiment detection, or multi-step reasoning usually require the heavy model from the start. But in my experience, those workflows are less than twenty percent of what teams actually built. The rest are overpaying for classification tasks dressed up as generation tasks.

Batching: When It Works and When It Wastes Your Time

Batching is the second place I look. Many APIs charge per request, not per item, and their rate limits force you into slow serial loops if you call them one by one. Batching fixes both problems. But only sometimes.

Default to batching any non-real-time API call that accepts bulk payloads. A geocoding job for five hundred addresses finishes in under two minutes batched instead of eight minutes serially, and it avoids the rate-limit retries that serial calling triggers. When the API has a native bulk endpoint, I use a Code node to chunk items into batches of fifty, then fire each batch with a controlled interval.

const items = $input.all();
const batchSize = 50;
const batches = [];

for (let i = 0; i < items.length; i += batchSize) {
  const chunk = items.slice(i, i + batchSize);
  batches.push({
    json: {
      addresses: chunk.map(item => item.json.address),
      batchIndex: Math.floor(i / batchSize)
    }
  });
}

return batches;

Where batching fails is real-time pipelines and APIs without bulk support. If a webhook needs a synchronous response within five seconds, batching introduces latency you cannot afford. If the API has no batch endpoint, you are left with n8n's built-in item batching, which helps with rate limits but does not reduce the total request count. In those cases, I do not batch; I throttle with exponential backoff and controlled concurrency instead.

The hidden trap in batching

When you send fifty items and three fail, the API often returns a mixed success-and-error response. You need downstream logic to split the batch, retry the failures individually, and log them. Teams that skip that step discover weeks later that three percent of their data never made it through.

I handle this by inspecting the batch response in a Code node and routing failed items to a retry loop while letting successes pass through. If retrying is not an option, I log the failure and alert rather than silently dropping the record.

Conditional Execution: Skip the Call Entirely

Before I call an expensive API, I ask whether the input has actually changed since the last run. If it has not, I skip the call entirely.

This sounds obvious, but most webhook-driven workflows ignore it. A CRM fires an event on every field change, including irrelevant metadata updates like last_viewed_at. If the workflow regenerates an AI product description for every event, it makes five thousand LLM calls a day when only fifty of them involved a meaningful change to the product name, category, features, or price.

I fix this with an input hash. In a Code node, I take only the fields that affect the downstream result, stringify them, and generate an MD5 hash. I compare that hash against the previous run's hash, stored in a database or Google Sheet. If they match, the workflow returns the cached result. If they differ, it proceeds to the expensive API and writes the new hash.

const crypto = require('crypto');
const product = $input.first().json;

const relevant = {
  name: product.name,
  category: product.category,
  features: product.features,
  price: product.price
};

const hash = crypto
  .createHash('md5')
  .update(JSON.stringify(relevant))
  .digest('hex');

const previous = $input.first().json._previous_hash;

return [{
  json: {
    ...product,
    currentHash: hash,
    needsRegeneration: hash !== previous
  }
}];

Downstream, an IF node checks needsRegeneration. The false branch returns the previously generated description without touching the LLM. On a catalog with a thousand products and five thousand daily CRM events, this check reduces five thousand OpenAI calls to fifty. At three cents per call, that is one hundred fifty dollars a day saved by a single comparison.

Don't hash volatile fields

If you include a timestamp, an auto-generated ID, or a last_modified field that changes on every touch, your hash will never match and you will get zero benefit. Hash only the semantic inputs that actually alter the output.

Caching and the Cache-Pin Pattern

Conditional execution skips work when the data has not changed. Caching skips work when the data changes slowly. I use both.

The classic example is an exchange rate lookup. If a workflow converts prices from USD to EUR every time a product is viewed, it might call the rate API five hundred times a day. But exchange rates change daily, not per page view. I cache the response with a twenty-four-hour TTL in a Google Sheet or Postgres table, then check the cache before every call.

const cacheEntry = $input.first().json;
const ttlHours = 24;

if (cacheEntry?.rate && cacheEntry?.cached_at) {
  const age = (Date.now() - new Date(cacheEntry.cached_at).getTime()) / (1000 * 60 * 60);
  if (age < ttlHours) {
    return [{
      json: {
        rate: cacheEntry.rate,
        source: 'cache',
        expires_in_hours: Math.round(ttlHours - age)
      }
    }];
  }
}

return [{ json: { source: 'miss', needs_refresh: true } }];

A cache hit costs zero API calls. On a five-hundred-call-per-day workflow, this drops the API usage to one call per day. That is a 99.8 percent reduction on that single endpoint.

TTL discipline matters. A TTL that is too short wastes calls; a TTL that is too long serves stale data. I set the TTL to match the real-world volatility of the data: twenty-four hours for exchange rates, five minutes for stock prices, a week for company metadata. When in doubt, I start long and tighten based on observed data quality issues.

Framework · The cache-pin pattern

After a successful test run, pin the output data on the node so downstream edits don't trigger re-execution. Combined with swapping the Webhook trigger for a Manual Trigger during development, this eliminates the silent cost of building.

Idempotency is part of the same discipline. Webhook senders retry on timeout, and without deduplication keys, you process the same event twice and pay for both executions. I log every incoming webhook's idempotency key to a table before processing, and I return 200 OK immediately if the key already exists. The sender stops retrying, and I stop double-paying.

The Looks-Cheap Trap

There is a class of API that looks harmless in isolation and becomes a hemorrhage in aggregate. The worst offenders are the ones priced at fractions of a penny.

I see this most often with reference data lookups: exchange rates, geocoding, enrichment APIs, and configuration fetches. A workflow processes a hundred orders, and because nobody enabled Execute Once on the config node, it fetches the same exchange rate a hundred times.

Key takeaway

The Execute Once checkbox is right there in the node settings. One call instead of a hundred. At a thousand executions a day, that single checkbox saves a hundred dollars a month.

Token waste falls into the same category. LLM pricing is per token, and sending a thirty-page transcript to GPT-4o when only the first five pages and last two pages matter is like mailing a ream of paper when a postcard would do. I pre-process long inputs in a Code node, keeping the head and tail and summarizing the middle with a cheap model before sending the package to the expensive one.

const transcript = $input.first().json.transcript;
const maxChars = 12000;

if (transcript.length <= maxChars) {
  return [{ json: { optimizedText: transcript, strategy: 'none' } }];
}

const head = transcript.substring(0, 4000);
const tail = transcript.substring(transcript.length - 3000);
const middle = transcript.substring(4000, transcript.length - 3000);

return [{
  json: {
    head,
    middle,
    tail,
    strategy: 'head-tail-with-middle-summary',
    estimatedTokensSaved: Math.floor(middle.length / 4)
  }
}];

Then I send the middle section through GPT-4o-mini for compression, and feed the concatenated result to GPT-4o. On a call transcript pipeline, this cuts the per-transcript cost by roughly seventy-eight percent.

The Real-World Math

So what happens when you stack these patterns on a single workflow? The savings do not just add up; they multiply, because each layer removes calls that the previous layer would have processed.

Consider a pipeline that handles support tickets, enriches customer data, and generates draft replies. Before optimization:

Cost Center Before Driver
Classification / routing $30.00/day 1,000 calls × $0.03 on GPT-4o
Data enrichment $10.00/day 1,000 calls × $0.01
Exchange rate lookup $1.00/day 1,000 calls × $0.001
Daily total $41.00

After applying the cheap-first chain, conditional execution, caching, and Execute Once:

Cost Center After Driver
Classification $0.10/day 1,000 calls × $0.0001 on GPT-4o-mini
Escalated replies $6.00/day 200 calls × $0.03 on GPT-4o
Data enrichment $0.50/day 50 calls × $0.01 after input-hash dedup
Exchange rate $0.001/day 1 call × $0.001 with 24h TTL
Daily total $6.60 84% reduction

That is an eighty-four percent reduction. Over a month, the difference between $1,230 and $198 pays for actual engineering time.

Measure or Die

None of this works if you cannot see where the money is going. I add a lightweight logging node after every paid API call that records the timestamp, model, endpoint, input tokens, output tokens, and estimated cost.

const usage = $input.first().json.usage || {};
const pricing = {
  'gpt-4o': { input: 0.0025, output: 0.01 },
  'gpt-4o-mini': { input: 0.00015, output: 0.0006 }
};
const model = $input.first().json.model || 'unknown';
const p = pricing[model] || { input: 0, output: 0 };
const cost = ((usage.prompt_tokens || 0) / 1000) * p.input +
             ((usage.completion_tokens || 0) / 1000) * p.output;

return [{
  json: {
    timestamp: new Date().toISOString(),
    workflow: $workflow.name,
    node: $prevNode.name,
    model,
    input_tokens: usage.prompt_tokens,
    output_tokens: usage.completion_tokens,
    cost_usd: Math.round(cost * 10000) / 10000
  }
}];

I write this to a Postgres table and run a weekly query grouping by workflow and model.

SELECT
  workflow_name,
  model,
  COUNT(*) as call_count,
  SUM(estimated_cost_usd) as total_cost
FROM api_usage_log
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY workflow_name, model
ORDER BY total_cost DESC;

Without this log, you are optimizing in the dark. With it, the worst offender is obvious and the payoff from fixing it is quantified before you start.

What to Do Monday Morning

API cost optimization is not a one-time audit. It is a maintenance habit.

Cross-reference the invoice against the execution log

Pull last month's invoice. Name the three most expensive nodes.

Demote classification and routing to a cheap model

Replace any classification, routing, or extraction step with the cheapest model that can handle it. Escalate only on failure.

Add input-hash dedup to scheduled LLM calls

Before every LLM call that runs on a schedule or webhook, add an input-hash check. If the data has not changed, skip it.

Cache reference lookups with TTLs that match real volatility

Exchange rates, geocoding, company info — cache them with a TTL that matches how often the data actually changes. Twenty-four hours for FX; a week for company metadata.

Use Manual Triggers and pinned data during development

Swap active Webhook triggers for Manual Triggers, and pin test data on any node upstream of a paid API.

Turn on Execute Once everywhere it makes sense

Any node that fetches shared configuration, tokens, or exchange rates gets the Execute Once checkbox.

Reserve 30 minutes every Friday for the cost log

Open the weekly cost query. Attack the top item.

The teams that keep their API bills sane are not the ones with the best vendor discounts. They are the ones that treat every call as a failure mode until proven otherwise.