6 Ways I Cut My Claude Token Usage in Half!

If you're on the $20/month Claude plan, you absolutely need to implement these tips to reduce your Claude token consumption

Sabrina Ramonov 🍄

Apr 05, 2026

1. Stop Using Opus for Everything

Type /models in Claude Code.

Then, switch models based on task complexity.

Opus for complex multi-file refactors, architecture decisions, debugging gnarly issues.

Sonnet for writing tests, simple edits, explaining code, most daily work.

Haiku for quick lookups, formatting, renaming, anything repetitive.

You don’t need a sports car to go get groceries.

This alone makes a HUGE difference.

Far more efficient model routing.

2. Clear Your Context Between Tasks

Every time you hit enter, Claude Code is shipping tons of system context before it even looks at what you typed.

And it snowballs.

The longer your session runs, the more bloated it gets. Slower responses, worse quality, higher cost.

You’re literally paying more to get dumber answers.

To fix this:

/clear between unrelated tasks. Wipe the slate, start fresh.
/compact when you’re about to start something big. It squeezes your conversation down to just the important parts.

3. Use CLI Tools Instead of MCP

If a CLI exists for a tool, use it over MCP. Faster AND cheaper.

GitHub is the perfect example. The gh CLI works better and eats WAY fewer tokens than the GitHub MCP server.

MCP tools inject their full schema into your context on BOTH sides… the tool definitions going in, the raw output coming back. You’re paying for all of it.

My rule of thumb:

CLI and Skills where possible.

MCP if there’s no alternative.

4. Install the Context-Mode Plugin

This open source project keeps raw MCP tool output from flooding your context window.

I use it daily. Runs in the background so I don’t have to proactively do anything.

It cuts MCP token usage by 50-90%.

The concept is simple. When an MCP tool returns 10,000 tokens of raw JSON, context-mode indexes it in a sandbox instead of dumping it into your conversation. You get a summary. Claude gets the info it needs. Thus, your context stays clean.

Install it, configure it, done!!

Biggest bang for buck if you use loads of MCP servers.

5. Keep Your CLAUDE.md Lean

Your CLAUDE.md gets injected into EVERY single request.

Every turn. Every follow-up. Every /clear and fresh start.

If your CLAUDE.md is 5,000 tokens, you’re taxed 5,000 tokens on every interaction before Claude even reads your code!

Give it 5 rules & point it to the details when it needs them. Think “email with links”, not a “2000-page employee handbook”.

Keep it under 2,000 tokens.

Put the detailed stuff in separate files. Reference them with file paths. Claude reads them when it needs to.

Here’s a skeleton:

# CLAUDE.md

## Rules
- Use TypeScript strict mode
- Write tests for every new function
- Follow existing patterns in the codebase

## Key Files
- API routes: see src/api/README.md
- Database schema: see docs/schema.md
- Style guide: see docs/style-guide.md

5 rules.

3 file pointers.

Under 500 tokens.

Claude reads those linked files ONLY when it’s working on something relevant.

NOT every single turn.

BIG saving if you have a super bloated claude.md file (guilty!)

6. Run Claude Code for FREE With Ollama

$0 dollars. Forever.

Everything runs local on your own machine, no API key needed, no subscription, no usage caps.

Step 1: Install Ollama

Head to ollama.com and grab the installer. It takes about 30 seconds on Mac, Windows, or Linux.

Step 2: Pull a coding model

Open your terminal and run:

ollama pull qwen3-coder

This grabs a 30B parameter model with a 128K context window. As of right now, it’s the strongest free option for coding tasks.

Which model fits your machine:

- 16GB+ RAM: qwen3-coder is your best bet

- 8-16GB RAM: go with devstral-2-small (24B), still very capable

- 8GB or less: granite3.3:8b (8B) will run, but you’ll feel the difference

If you’re on Apple Silicon, the unified memory architecture handles 24B+ models without breaking a sweat.

Step 3: Start the Ollama server

ollama serve

Leave this running in a terminal tab.

Step 4: Point Claude Code at Ollama

Set 2 environment variables before launching:

export ANTHROPIC_BASE_URL=http://localhost:11434/v1
export ANTHROPIC_MODEL=qwen3-coder

Then run claude like you normally would.

Instead of hitting Anthropic’s servers, it talks to your local model.

Step 5: Know the limits

These open source models are good enough for writing, research, summarizing, and simpler tasks.

But for serious technical work… I wouldn’t do it. I’ve had models tell me they edited a file when literally nothing changed.

You also lose web search (you can add it back with a Brave or Tavily MCP server), there’s no prompt caching so every turn reprocesses your full context, and the gap between these models and Opus is quite noticeable on hard problems.

My recommendation:

Let Ollama handle easier stuff.

Point your Claude subscription at tough work.

You’d be surprised how much of your daily usage falls into the “easy stuff” bucket anyway!

The pattern behind all 6 tips is the same…

Stop paying premium prices for tasks that don’t need it.

You don’t need to drive the Ferrari to the grocery story :D

(or something like that…)

Match the tool to the job.

Keep your context clean.

Don’t feed Claude more than it needs.

Start with a few tips and you’ll notice the difference TODAY.

Then experiment with others 1 at a time.

Your Claude bill will thank you :)

If you LOVE this newsletter, please SHARE it to help teach more people for FREE!

What should I do next?

P.S. Need More Help? 👋

1/ Free AI courses
2/ Free AI prompts
3/ Free AI automations
4/ Free AI vibe coding
5/ Ask me anything @ Friday livestream
6/ Free private community for Women Building AI
7/ I built Blotato to grow 1M+ followers in 1 year
8/ If you want AI speakers/consultants/coaching, REPLY with your project & budget. I will refer you (I make $0 zero money).

Sabrina Ramonov 🍄

Discussion about this post

Ready for more?