Blog › Token Economy as Cash Economy: Treating Model Selection as a Staffing Decision

AI Workflow

Token Economy as Cash Economy: Treating Model Selection as a Staffing Decision

Teh Kim GuanACMA · CGMA

2026-05-13 · 8 min read

Token Economy as Cash Economy: Treating Model Selection as a Staffing Decision

The Reframe

Most AI workflow advice talks about model selection as a technical choice. Use the bigger model for harder problems. Use the smaller model when speed or cost matters. The advice is correct and almost useless because it gives you no rule for the in-between cases, which is most cases.

A better frame: treat the model tier as a staffing decision in your consultancy. You already make staffing decisions every week. You already know that putting a senior partner on a junior task wastes money, and putting a junior on a senior task wastes the client's confidence. The same logic applies to model selection, with the convenient property that the seniority tiers are cleaner and the rate cards are public.

The Three Tiers

Three-tier AI model staffing framework: Opus 4.7 as Senior Partner for strategy work, Sonnet 4.6 in parallel as Mid-level Consultant for production, and Scripts as Junior Coordinator for mechanical tasks

Model tier	Equivalent role	Rate card analogue	Where it earns its place
Opus 4.7	Senior partner	RM 800 to RM 1,500 per hour	Strategy, methodology, judgment-heavy synthesis, voice tuning, high-stakes single deliverables
Sonnet 4.6 (parallel subagents)	Mid-level consultant	RM 300 to RM 500 per hour	Templated production work, parallel execution, prompt preparation, draft writing against a clear brief
Scripts and lean orchestrators	Junior coordinator or operations support	RM 80 to RM 150 per hour	Manifest building, file routing, em-dash scrubbing, format conversion, mechanical loops

The frame collapses three separate decisions (which model, what concurrency, what fallback) into one consistent question: who would I staff this with if it was a human team?

The Specific Discipline

In the consulting workflow I ran recently, the discipline played out as follows.

Strategy and pillar articles got Opus 4.7. The topical cluster blueprint required depth, the voice spec required nuance, and the cornerstone articles required Opus's better long-context reasoning. Estimated spend: about RM 600 across all strategic work. This is the senior partner setting the engagement direction. You do not skimp here.

Use-case articles and image briefs got Sonnet 4.6 in parallel batches of five to ten subagents. The work is templated. Each subagent receives the schema, the voice anchors, and an explicit batch boundary. Quality came in at 85% to 95% of Opus output, with cost reductions of roughly 90%. This is the mid-level consultant team executing against a partner-set brief. They do the volume work and they do it well.

Plumbing got scripts. Manifest building, em-dash scrubbing, file routing, and format conversion are mechanical operations. They do not require reasoning. They require correctness and idempotency. Total LLM spend on plumbing: zero. This is the operations layer. You do not pay an LLM to do work that a thirty-line Python script can do faster and more reliably.

The Mistake I Made

The single biggest cost leak in last week's engagement was leaving the parent orchestrator on Opus 4.7 during the bulk firing phase. The orchestration logic is mechanical: read JSON, fire calls, record results. Opus's reasoning depth was not consumed. Each operation echoed the prompt back into context, and Opus context is the most expensive place to store these echoes. Estimated waste: about USD 15 of unnecessary spend on what should have been a Sonnet or script operation.

The staffing analogue makes the mistake obvious. I had a senior partner manning the photocopier. The partner did the job correctly, but the rate was wrong for the work. No consultancy would tolerate that for an afternoon, let alone for an entire orchestration phase.

The fix for the next engagement: switch the parent session to Sonnet 4.6 for the firing phase, or move the firing phase to a Python script that bypasses the agent entirely. Either correction saves roughly 60% on that phase and frees the parent context for higher-value work.

Why the Frame Works

Three properties make the staffing analogy useful rather than cute.

The rate cards are real. Opus 4.7 1M Extra High costs roughly USD 30 per million input tokens and USD 150 per million output tokens. Sonnet 4.6 costs roughly USD 3 per million input and USD 15 per million output. The 10x ratio is comparable to senior-to-mid consultant rate differentials in Malaysia. Decisions made on the LLM rate card transfer cleanly to decisions about human staffing, and vice versa.

The cognitive demand maps cleanly. Strategy work demands more depth than templated production work, in both human and LLM contexts. The match is not metaphorical. The same underlying property (ability to hold and integrate broader context, willingness to engage with ambiguity, capacity to challenge inputs) determines who succeeds at the work, regardless of whether the worker is human or model.

The judgment moves are similar. When you decide whether to put a senior partner on a task, you ask: would the partner's depth meaningfully change the outcome, or would the partner just be doing what a mid-level can do? The same question applies to Opus vs Sonnet. If the answer is "the depth would not change the outcome," you have a Sonnet task and you are wasting money on Opus.

The Operating Rule

Translate this into a single rule you can apply in real time.

Match the model to the cognitive demand of the work, not to the importance of the engagement.

Important engagements often have low-cognitive-demand sub-stages. A high-stakes valuation report has strategic methodology selection (Opus territory), parallel evidence gathering (Sonnet parallel), and table assembly (script). All three live inside the same engagement. The same rule produces different model choices at different stages.

The mistake to avoid is the inverse: assigning the model based on engagement importance. "This is a major client, so I will use Opus throughout." The result is paying senior partner rates for work that was always going to be mid-level or junior in nature. Importance does not change the work. The work is what it is.

What This Means for Pricing

If you accept the staffing analogy, the pricing implication is direct.

A traditional consulting engagement is priced on time-and-materials at blended rates. If a partner runs the engagement for sixty hours at RM 1,000 per hour, the bill is RM 60,000 plus expenses, and the firm captures roughly 30% to 50% margin after delivery costs.

An AI-augmented engagement using the staffing analogy properly might consume:

4 hours of Opus 4.7 strategic work at, say, RM 50 of token spend
30 hours of Sonnet 4.6 parallel production work at, say, RM 80 of token spend
20 hours of script execution at, say, RM 5 of compute spend
10 hours of senior human time on judgment, client interaction, quality gates

Total token and compute spend: roughly RM 135. Total senior human time: 10 hours. If the engagement is priced anywhere between RM 18,000 (productised tier) and RM 45,000 (enterprise tier), the gross margin sits at 95% or higher.

The ethical question is whether to price at cost-plus (which would imply RM 1,000 to RM 5,000 per engagement) or at value (which would imply RM 18,000 to RM 45,000 per engagement). My current view is that value pricing is correct because the deliverable's worth to the client is what it is, regardless of whether you produced it with a team of fifteen or with a senior consultant and three model tiers. The client is buying the outcome.

The Discipline to Hold

Two disciplines are easy to drift on and worth holding.

Re-evaluate model assignments at every workflow stage, not at the engagement start. It is tempting to set up the engagement on Opus and stay there. The discipline is to ask, before each stage, whether the cognitive demand has dropped or risen. If it dropped, downshift. If it rose, upshift. Default-Opus is the default-failure mode.

Match concurrency to the parallelisability of the work, not to the cost saving you want. Five Sonnet subagents in parallel is the right answer when the work splits cleanly. Five Sonnet subagents in parallel is the wrong answer when the work has sequential dependencies that the dispatch hides. The parallelism has to fit the work shape. This is where the Brief-then-Fire architecture earns its keep.

Closing

The staffing analogy is not a trick. It is a recognition that LLM workflows have the same structural properties as consulting team workflows: mixed cognitive demands, mixed concurrency profiles, mixed cost tiers, mixed quality stakes. The disciplines that built effective consulting teams (right person on right task, parallel execution where work allows, ruthless about not wasting senior time on junior work) port over directly.

The 95% margin economics in AI-augmented consulting are not a function of the AI being magic. They are a function of staffing the workflow correctly. Senior on strategy. Mid-level on parallel production. Junior or scripts on plumbing. Same rule that built every functional consulting practice in history. New tools. Same arithmetic.

About the Author

Teh Kim Guan

Product Consultant · General Manager, PEPS Ventures

Strategy and technology are the same decision. Over 15 years in fintech (CTOS, D&B), prop-tech (PropertyGuru DataSense), and digital startups, I have built frameworks that help founders and executives make both moves at once. Based in Kuala Lumpur.

Working on a 0→1 product?

I help founders and operators go from idea to validated product. Let's talk about yours.

Get in touch →