Measure before you optimize

Token waste is hard to fix by intuition. A prompt that feels short can carry a large hidden context window. A coding agent that looks busy can spend several turns repeating the same failed strategy. A cheaper model can be the right answer for summaries and classification, but the wrong answer for a subtle architecture decision.

Start by measuring. Anthropic documents token counting and Claude Code cost commands. OpenAI exposes cached token usage and recommends reducing output tokens and filtering retrieved context. Gemini also provides token-counting guidance for API workflows. Once you can see the inputs, outputs, and cache behavior, token reduction becomes engineering instead of folklore.

Operating rule: do not optimize every prompt. Optimize the repeated prompts, long-context sessions, automated agent loops, and provider workflows that run often enough to matter.

The six high-impact levers

1. Shrink context bloat

Context grows quietly during coding work. Files, diffs, terminal output, previous attempts, and standing instructions can all become part of the next turn. Before sending a prompt, ask whether the model needs the whole file, the whole log, or only the failing function and the exact error.

2. Compact or reset stale sessions

In Claude Code, use /compact when you need to preserve the important state but remove accumulated noise. Use /clear when the next task does not need the old context at all. Mixed-purpose sessions are convenient, but they are often expensive.

3. Route simple work to cheaper models

Summaries, changelog drafts, classification, fixture generation, and mechanical rewrites rarely need the strongest model. Keep premium models for ambiguous debugging, architecture tradeoffs, and tasks where a wrong answer costs more than the tokens.

4. Cut output length

Output tokens are often the expensive side of the call. Ask for the answer format you need: patch only, bullets only, command output only, or a short diagnosis followed by the fix. Do not pay for narration when the task needs an answer.

5. Preprocess noisy inputs

Logs, PDFs, repository dumps, and generated files should be filtered before the model sees them. Keep the error, the surrounding lines, the relevant function, and the command that produced the output. Remove repeated stack traces, minified files, lockfile noise, and unrelated logs.

6. Stop runaway loops

Agentic workflows can spend heavily while stuck. If an agent has worked on the same error class for several turns without changing the state, pause the run. Read the diff, inspect the actual error, and redirect the task with tighter instructions.

Claude Code workflow habits

Claude Code is powerful because it can work across a project, but that same strength makes context discipline important. Keep stable project rules inCLAUDE.md. Use short task boundaries. Compact between unrelated features. Clear history before switching from implementation to a different investigation.

Anthropic's Claude Code cost documentation describes commands such as/cost, context clearing, and compaction. Treat those as part of the workflow, not cleanup after something has already gone wrong.

For rate-limit pressure specifically, read the longer explainer: why Claude Code rate limits happen. If you want the runway visible while you work, the product page is here: Claude Code rate-limit tracker.

API and app workflows

In API products, token optimization usually comes from repeatability: stable prompt prefixes, smaller retrieved context, output limits, and model routing. OpenAI and Anthropic both document prompt caching behavior. The practical pattern is simple: keep shared instructions stable and put dynamic content late in the prompt so repeated prefixes are easier to reuse.

For retrieval systems, do not send every document chunk because it might be useful. Filter aggressively, deduplicate, and cap the amount of context per answer. For user-facing apps, measure cache hit rates and output length before changing model quality.

If your costs span several providers, the broader entry points are AI spend tracker for Mac, AI budget tracker for developers, and OpenAI cost tracker.

Provider-specific playbooks

Claude Code: preserve state, remove noise

Claude Code is strongest when it has the right project context and weakest when a session carries too much stale history. Keep stable instructions in the project guide, ask for bounded diffs, use compaction when the session is still relevant, and clear context when switching to unrelated work. If rate pressure is rising, finish the current patch before asking for another broad investigation.

Cursor: avoid sending the whole workspace by habit

Cursor workflows can become expensive when every question drags in large files, generated output, or old terminal logs. Highlight only the relevant function, failing test, interface, or diff. Ask for a plan before edits when the task is ambiguous, then narrow the implementation request to the files that actually need to change.

OpenAI API: control output and repeated prefixes

For API work, reduce output tokens first because verbose responses compound quickly. Keep stable instructions stable, put dynamic content later, and cap response formats with schemas or explicit structures when possible. Use prompt caching where the provider supports it, but still measure cache hit rates instead of assuming the cache is doing useful work.

Gemini and research workflows: summarize before expanding

Research prompts often invite long context and long answers. Start with a narrow question, ask for source extraction or a short decision table, then expand only the section that matters. If a task needs broad reading, separate collection from synthesis so you do not repeatedly pay for the same context.

Prompt patterns that reduce token waste

Replace broad asks with scoped asks

Weak prompt: “Review this whole repo and improve it.” Better prompt: “Inspect the checkout flow files only, identify the highest-risk conversion issue, and propose a patch without touching unrelated modules.” The second prompt lowers context, reduces wandering, and gives the model a sharper success condition.

Ask for the output shape upfront

If you need a patch, ask for a patch. If you need a diagnosis, ask for three bullets and the next command. If you need a comparison, define the columns. Output shape is one of the simplest ways to reduce cost because every unnecessary explanation becomes paid output.

Use checkpoints for agentic work

For long coding tasks, ask the agent to stop after investigation, after the first patch, and after verification. Checkpoints prevent a model from turning uncertainty into expensive motion. They also make it easier to notice when a task needs a human decision instead of another automated loop.

Keep reusable context outside the chat when possible

Project rules, test commands, style decisions, and deployment notes belong in project files that the agent can reference deliberately. Repeating the same background in every prompt wastes tokens and increases the chance of stale instructions. A good repository guide often saves more tokens than a clever one-off prompt.

Tools and repos that actually help

Tool lists are only useful when they map to a workflow. These are worth knowing because they help you count, route, test, or package context more deliberately.

openai/tiktokenCount OpenAI tokens locally before a prompt turns into spend.promptfoo/promptfooTest prompt changes against quality and cost instead of relying on feel.BerriAI/litellmRoute work across providers and make cost visible in multi-model apps.simonw/llmRun repeatable model calls from the terminal and compare outputs quickly.yamadashy/repomixPackage repository context more deliberately before sending it to a model.

Primary references

This guide avoids advice that depends on unverifiable hacks such as gaming a usage window or avoiding alleged peak hours. The useful long-term tactics are the ones that survive changes in provider policy.

The monitoring layer

Manual optimization reduces waste. Monitoring catches drift. Those are different jobs. You can compact a session, route simple tasks to a cheaper model, and still miss the fact that your combined Claude Code, OpenAI, Cursor, Copilot, and Gemini spend is climbing week by week.

Tokens 4 Breakfast keeps that signal in the Mac menu bar: provider-level usage, subscriptions, session budgets, and monthly pressure. It is free for one provider and takes about two minutes to try.

How to Reduce AI Token Usage for Claude Code, OpenAI, Cursor, and Gemini

Measure before you optimize

The six high-impact levers

1. Shrink context bloat

2. Compact or reset stale sessions

3. Route simple work to cheaper models

4. Cut output length

5. Preprocess noisy inputs

6. Stop runaway loops

Claude Code workflow habits

API and app workflows

Provider-specific playbooks

Claude Code: preserve state, remove noise

Cursor: avoid sending the whole workspace by habit

OpenAI API: control output and repeated prefixes

Gemini and research workflows: summarize before expanding

Prompt patterns that reduce token waste

Replace broad asks with scoped asks

Ask for the output shape upfront

Use checkpoints for agentic work

Keep reusable context outside the chat when possible

Tools and repos that actually help

Primary references

Estimate your monthly AI spend

The 10-point checklist

The monitoring layer

Help improve Tokens 4 Breakfast.