Measure before you optimize
Token waste is hard to fix by intuition. A prompt that feels short can carry a large hidden context window. A coding agent that looks busy can spend several turns repeating the same failed strategy. A cheaper model can be the right answer for summaries and classification, but the wrong answer for a subtle architecture decision.
Start by measuring. Anthropic documents token counting and Claude Code cost commands. OpenAI exposes cached token usage and recommends reducing output tokens and filtering retrieved context. Gemini also provides token-counting guidance for API workflows. Once you can see the inputs, outputs, and cache behavior, token reduction becomes engineering instead of folklore.
Operating rule: do not optimize every prompt. Optimize the repeated prompts, long-context sessions, automated agent loops, and provider workflows that run often enough to matter.
The six high-impact levers
1. Shrink context bloat
Context grows quietly during coding work. Files, diffs, terminal output, previous attempts, and standing instructions can all become part of the next turn. Before sending a prompt, ask whether the model needs the whole file, the whole log, or only the failing function and the exact error.
2. Compact or reset stale sessions
In Claude Code, use /compact when you need to preserve the important state but remove accumulated noise. Use /clear when the next task does not need the old context at all. Mixed-purpose sessions are convenient, but they are often expensive.
3. Route simple work to cheaper models
Summaries, changelog drafts, classification, fixture generation, and mechanical rewrites rarely need the strongest model. Keep premium models for ambiguous debugging, architecture tradeoffs, and tasks where a wrong answer costs more than the tokens.
4. Cut output length
Output tokens are often the expensive side of the call. Ask for the answer format you need: patch only, bullets only, command output only, or a short diagnosis followed by the fix. Do not pay for narration when the task needs an answer.
5. Preprocess noisy inputs
Logs, PDFs, repository dumps, and generated files should be filtered before the model sees them. Keep the error, the surrounding lines, the relevant function, and the command that produced the output. Remove repeated stack traces, minified files, lockfile noise, and unrelated logs.
6. Stop runaway loops
Agentic workflows can spend heavily while stuck. If an agent has worked on the same error class for several turns without changing the state, pause the run. Read the diff, inspect the actual error, and redirect the task with tighter instructions.
Claude Code workflow habits
Claude Code is powerful because it can work across a project, but that same strength makes context discipline important. Keep stable project rules inCLAUDE.md. Use short task boundaries. Compact between unrelated features. Clear history before switching from implementation to a different investigation.
Anthropic's Claude Code cost documentation describes commands such as/cost, context clearing, and compaction. Treat those as part of the workflow, not cleanup after something has already gone wrong.
For rate-limit pressure specifically, read the longer explainer: why Claude Code rate limits happen. If you want the runway visible while you work, the product page is here: Claude Code rate-limit tracker.
API and app workflows
In API products, token optimization usually comes from repeatability: stable prompt prefixes, smaller retrieved context, output limits, and model routing. OpenAI and Anthropic both document prompt caching behavior. The practical pattern is simple: keep shared instructions stable and put dynamic content late in the prompt so repeated prefixes are easier to reuse.
For retrieval systems, do not send every document chunk because it might be useful. Filter aggressively, deduplicate, and cap the amount of context per answer. For user-facing apps, measure cache hit rates and output length before changing model quality.
If your costs span several providers, the broader entry points are AI spend tracker for Mac, AI budget tracker for developers, and OpenAI cost tracker.
Provider-specific playbooks
Claude Code: preserve state, remove noise
Claude Code is strongest when it has the right project context and weakest when a session carries too much stale history. Keep stable instructions in the project guide, ask for bounded diffs, use compaction when the session is still relevant, and clear context when switching to unrelated work. If rate pressure is rising, finish the current patch before asking for another broad investigation.
Cursor: avoid sending the whole workspace by habit
Cursor workflows can become expensive when every question drags in large files, generated output, or old terminal logs. Highlight only the relevant function, failing test, interface, or diff. Ask for a plan before edits when the task is ambiguous, then narrow the implementation request to the files that actually need to change.
OpenAI API: control output and repeated prefixes
For API work, reduce output tokens first because verbose responses compound quickly. Keep stable instructions stable, put dynamic content later, and cap response formats with schemas or explicit structures when possible. Use prompt caching where the provider supports it, but still measure cache hit rates instead of assuming the cache is doing useful work.
Gemini and research workflows: summarize before expanding
Research prompts often invite long context and long answers. Start with a narrow question, ask for source extraction or a short decision table, then expand only the section that matters. If a task needs broad reading, separate collection from synthesis so you do not repeatedly pay for the same context.
Prompt patterns that reduce token waste
Replace broad asks with scoped asks
Weak prompt: “Review this whole repo and improve it.” Better prompt: “Inspect the checkout flow files only, identify the highest-risk conversion issue, and propose a patch without touching unrelated modules.” The second prompt lowers context, reduces wandering, and gives the model a sharper success condition.
Ask for the output shape upfront
If you need a patch, ask for a patch. If you need a diagnosis, ask for three bullets and the next command. If you need a comparison, define the columns. Output shape is one of the simplest ways to reduce cost because every unnecessary explanation becomes paid output.
Use checkpoints for agentic work
For long coding tasks, ask the agent to stop after investigation, after the first patch, and after verification. Checkpoints prevent a model from turning uncertainty into expensive motion. They also make it easier to notice when a task needs a human decision instead of another automated loop.
Keep reusable context outside the chat when possible
Project rules, test commands, style decisions, and deployment notes belong in project files that the agent can reference deliberately. Repeating the same background in every prompt wastes tokens and increases the chance of stale instructions. A good repository guide often saves more tokens than a clever one-off prompt.
Tools and repos that actually help
Tool lists are only useful when they map to a workflow. These are worth knowing because they help you count, route, test, or package context more deliberately.
Primary references
This guide avoids advice that depends on unverifiable hacks such as gaming a usage window or avoiding alleged peak hours. The useful long-term tactics are the ones that survive changes in provider policy.