๐ฆ Gary Redesign
Flexible + Validated Architecture
v2 โ Updated
January 30, 2026 โข Prepared by Max
๐ Executive Summary
Gary (the Google Ads Slack bot) was hallucinating numbers. After analysis, we identified three root causes. This v2 proposal keeps the agent flexible while adding a validation layer โ avoiding the overcorrection of rigid, fixed tools.
โ Current State
- LLM writes GAQL queries (sometimes wrong source)
- LLM does arithmetic in-context (error-prone)
- No verification before output
- Context scattered across 6+ files
- Missing API credentials
โ
Proposed State
- LLM keeps flexibility to write queries
- Arithmetic via dedicated tool (can't mess up)
- Validator sub-agent checks everything
- Annotations show confidence level
- Evals catch regressions over time
v2 Change Why Not Rigid Tools?
The v1 proposal suggested 10 fixed query templates. That's too constraining โ users ask novel questions, and we'd constantly be adding new tools. Instead:
- Keep flexibility โ LLM can still handle any question
- Add guardrails โ Validator catches common mistakes
- Annotate, don't block โ User sees confidence level
- Build evals โ Learn from mistakes over time
Simplified No Separate Bot Needed
Gary was a separate Slack bot process that shelled out to Claude CLI. That's obsolete now โ OpenClaw already has native Slack access.
- Before: Slack โ slack_bot.py โ subprocess(claude --print) โ parse response โ Slack
- After: Slack โ OpenClaw โ response โ Slack
No subprocess hacks. No "recover_response_from_session" workarounds. Just native integration.
๐ Root Cause Analysis
Critical No Verification Loop
Gary's architecture: Slack โ Claude CLI โ Response. If Claude hallucinates, nothing catches it.
Evidence from LEARNINGS.md:
User caught error where $643.55 รท 11 = $49.41, not $74.27
โ Gary queried correctly but botched the arithmetic
High Context Sprawl
The same rules appear in 6+ locations with subtle differences. Working from projects/ folder means missing LEARNINGS.md โ the file with corrections!
High keyword_view vs search_term_view Confusion
Called out six times in docs as "the #1 source of wrong numbers." The validator will explicitly check for this.
Critical Missing API Credentials
| Service | Status | Missing |
| Google Ads | โ Incomplete | Developer token, OAuth, refresh token |
| Google Sheets | โ ๏ธ Partial | gspread not installed, sheets not shared |
| Google Docs | โ None | Everything |
๐๏ธ Proposed Architecture: Validator Sub-Agent
The key insight: Keep the main agent flexible, but run every response through a skeptical validator before sending.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ USER QUESTION โ
โ "How is the ai assistant keyword doing?" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OPENCLAW (flexible, native Slack) โ
โ โ
โ โข Understands the question โ
โ โข Writes appropriate GAQL query โ
โ โข Executes query, gets results โ
โ โข Calls arithmetic tool for any calculations โ
โ โข Drafts response โ
โ โ
โ Output: { response, raw_query, raw_results, calculations } โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VALIDATOR SUB-AGENT โ
โ โ
โ Prompt: "You are a skeptical reviewer. Your job is to catch โ
โ errors before they reach the user. Check everything." โ
โ โ
โ Tools: โ
โ โข verify_query_source(intent, gaql) โ right data source? โ
โ โข verify_arithmetic(expr, result) โ recalculate independently โ
โ โข verify_consistency(data, response) โ numbers match? โ
โ โข check_sample_size(n) โ warn if too small โ
โ โ
โ Output: Annotations (does NOT block, only annotates) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FINAL OUTPUT โ
โ โ
โ Response + Annotations โ User โ
โ โ
โ Example: โ
โ "The 'ai assistant' keyword has 45 clicks, 3 QSUs, $127 CPA" โ
โ โ
Verified โ ๏ธ Small sample (n=3) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Why a Sub-Agent?
- Independent reasoning โ Can't be "convinced" by main agent's logic
- Different prompt โ Skeptical, adversarial, focused on catching errors
- Own tools โ Specialized for verification (arithmetic checker, query auditor)
- Parallel development โ Can improve validator without touching main agent
Validator Tools
| Tool | Purpose | Example Check |
verify_query_source() |
Right data source for the question? |
"keyword performance" โ must use keyword_view |
verify_arithmetic() |
Recalculate independently |
$643.55 รท 11 = ? (catch the $74.27 error) |
verify_consistency() |
Numbers in response match raw data? |
Response says "45 clicks" โ is that in the query result? |
check_sample_size() |
Warn if conclusions from tiny data |
n=3 conversions โ add warning |
Annotation Types
Response: "The 'ai assistant' keyword has 45 clicks, 3 QSU conversions, and a CPA of $58.50 over the last 7 days. Performance is within target."
Annotations:
โ
Query source verified (keyword_view)
โ
Arithmetic verified ($175.50 รท 3 = $58.50)
โ ๏ธ Small sample size (n=3)
Response: "Search term 'lindy ai' drove 12 conversions at $42 CPA."
Annotations:
โ
Query source verified (search_term_view)
๐ด Arithmetic mismatch: $504 รท 12 = $42.00, but query shows $38.50
Arithmetic Tool (Required for Main Agent)
The main agent cannot do math in-context. It must call the arithmetic tool:
// Instead of: "The CPA is $643.55 / 11 = $74.27" (WRONG)
// Agent calls:
calculate({
operation: "divide",
a: 643.55,
b: 11,
label: "CPA"
})
// Returns: { result: 58.50, label: "CPA", expression: "643.55 รท 11" }
This ensures arithmetic is always correct, and the validator can verify the tool was used.
๐ Evals Roadmap
Once the validator is running, we build a test suite from real failures:
Eval Categories
| Category | What We Test | Source |
| Query Source |
keyword_view vs search_term_view selection |
Historical confusions |
| Arithmetic |
CPA, totals, percentages |
LEARNINGS.md errors |
| Consistency |
Numbers in response match query |
User-reported issues |
| Edge Cases |
Zero conversions, missing data |
Synthetic tests |
| Warnings |
Should trigger sample size warning? |
Defined thresholds |
Process:
- Every time we catch an error, add it to the eval suite
- Run evals on validator changes
- Track accuracy over time
- Evals become the spec for validator behavior
๐ Target File Structure
The proposed project structure, organized by responsibility:
google-ads/
โโโ .env # API credentials (create from .env.example)
โโโ .env.example # Template with required vars
โโโ README.md # Project overview
โ
โโโ .claude/ # ๐ Symlink to source repo's .claude/
โ โโโ CLAUDE.md # Main context file
โ โโโ LEARNINGS.md # Corrections โ READ FIRST!
โ โโโ GOALS.md # Mission and targets
โ โโโ CONTEXT.md # Competitors, account info
โ โโโ REPORT_STYLE.md # Formatting guide
โ โโโ skills/
โ โโโ google-ads-query/
โ โโโ SKILL.md # Full query patterns
โ
โโโ tools/ # Agent tools
โ โโโ __init__.py
โ โโโ arithmetic.py # calculate() โ all math goes here
โ โโโ query.py # GAQL query execution
โ โโโ sheets.py # Google Sheets integration
โ
โโโ validator/ # Validator sub-agent
โ โโโ agent.py # Sub-agent orchestration
โ โโโ prompt.md # Skeptical reviewer prompt
โ โโโ tools/
โ โโโ __init__.py
โ โโโ verify_arithmetic.py # Recalculate independently
โ โโโ verify_query_source.py# keyword_view vs search_term_view
โ โโโ verify_consistency.py # Numbers match raw data?
โ โโโ check_sample_size.py # Warn on small n
โ
โโโ evals/ # Test suite
โ โโโ run_evals.py # Eval runner
โ โโโ test_arithmetic.py # Arithmetic error cases
โ โโโ test_query_source.py # Data source confusion cases
โ โโโ test_consistency.py # Response/data mismatch cases
โ โโโ cases/ # JSON test fixtures
โ โโโ arithmetic_errors.json
โ โโโ query_source.json
โ โโโ from_learnings.json # Converted from LEARNINGS.md
โ
โโโ scripts/ # ๐ Existing scripts (symlinked)
โ โโโ client.py # Google Ads API client
โ
โโโ service-account.json # ๐ Google service account (symlinked)
Key Changes from Current Structure
| Current | Proposed | Why |
.context/ (simplified copies) |
.claude/ (symlink to source) |
Single source of truth, includes LEARNINGS.md |
| Math in LLM context |
tools/arithmetic.py |
Deterministic, testable, can't hallucinate |
| No validation |
validator/ sub-agent |
Catches errors before they reach user |
| No tests |
evals/ suite |
Learn from mistakes, prevent regressions |
| Separate Slack bot process |
OpenClaw handles directly |
No subprocess, no CLI hacks โ native integration |
โ
Action Plan
Phase 1: Foundation (This Week)
- Set up Google Ads API credentials (developer token, OAuth flow)
- Create
.env file with all required credentials
- Fix context split-brain: symlink
.claude/ folder
- Install gspread, share target sheets with service account
Phase 2: Core Tools (Next Week)
- Build
calculate() arithmetic tool
- Build validator sub-agent with initial prompt
- Implement
verify_arithmetic()
- Implement
verify_query_source()
- Implement
verify_consistency()
Phase 3: Integration (Week 3)
- Wire validator sub-agent into response flow
- Add annotation formatting for Slack output
- Build initial eval suite (10-20 cases)
- Add
check_sample_size() warnings
Phase 4: Ryan's Report (Week 4)
- Get Ryan's current report format
- Map report sections to query patterns
- Build report generator with validation
- Expand eval suite based on report needs
๐ Success Metrics
| Metric | Current | Target |
| Arithmetic errors reaching user | ~10% | 0% (caught by validator) |
| Wrong data source | Unknown | 0% (validator checks) |
| Responses with annotations | 0% | 100% |
| Eval suite coverage | 0 cases | 50+ cases |
| Can replicate Ryan's report | No | Yes |
๐ฏ Design Principles
Flexibility with Guardrails
The main agent stays flexible โ it can handle novel questions. The validator adds a safety net without constraining what questions we can answer.
Annotate, Don't Block
The validator adds transparency, not friction. Users see confidence levels. Only critical errors (like arithmetic mismatches) get flagged prominently.
Learn from Mistakes
Every error becomes an eval case. The system gets smarter over time. Evals are the spec โ if we can't test it, we can't guarantee it.
Separation of Concerns
Main agent: understanding + querying + narrating. Arithmetic tool: all calculations. Validator: all verification. Each piece can improve independently.