🦞 Gary Redesign

Flexible + Validated Architecture

v2 — Updated

January 30, 2026 • Prepared by Max

📋 Executive Summary

Gary (the Google Ads Slack bot) was hallucinating numbers. After analysis, we identified three root causes. This v2 proposal keeps the agent flexible while adding a validation layer — avoiding the overcorrection of rigid, fixed tools.

❌ Current State

LLM writes GAQL queries (sometimes wrong source)
LLM does arithmetic in-context (error-prone)
No verification before output
Context scattered across 6+ files
Missing API credentials

✅ Proposed State

LLM keeps flexibility to write queries
Arithmetic via dedicated tool (can't mess up)
Validator sub-agent checks everything
Annotations show confidence level
Evals catch regressions over time

v2 Change Why Not Rigid Tools?

The v1 proposal suggested 10 fixed query templates. That's too constraining — users ask novel questions, and we'd constantly be adding new tools. Instead:

Keep flexibility — LLM can still handle any question
Add guardrails — Validator catches common mistakes
Annotate, don't block — User sees confidence level
Build evals — Learn from mistakes over time

Simplified No Separate Bot Needed

Gary was a separate Slack bot process that shelled out to Claude CLI. That's obsolete now — OpenClaw already has native Slack access.

Before: Slack → slack_bot.py → subprocess(claude --print) → parse response → Slack
After: Slack → OpenClaw → response → Slack

No subprocess hacks. No "recover_response_from_session" workarounds. Just native integration.

🔍 Root Cause Analysis

Critical No Verification Loop

Gary's architecture: Slack → Claude CLI → Response. If Claude hallucinates, nothing catches it.

Evidence from LEARNINGS.md:

User caught error where $643.55 ÷ 11 = $49.41, not $74.27
→ Gary queried correctly but botched the arithmetic

High Context Sprawl

The same rules appear in 6+ locations with subtle differences. Working from projects/ folder means missing LEARNINGS.md — the file with corrections!

High keyword_view vs search_term_view Confusion

Called out six times in docs as "the #1 source of wrong numbers." The validator will explicitly check for this.

Critical Missing API Credentials

Service	Status	Missing
Google Ads	❌ Incomplete	Developer token, OAuth, refresh token
Google Sheets	⚠️ Partial	gspread not installed, sheets not shared
Google Docs	❌ None	Everything

🏗️ Proposed Architecture: Validator Sub-Agent

The key insight: Keep the main agent flexible, but run every response through a skeptical validator before sending.

┌─────────────────────────────────────────────────────────────────┐ │ USER QUESTION │ │ "How is the ai assistant keyword doing?" │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ OPENCLAW (flexible, native Slack) │ │ │ │ • Understands the question │ │ • Writes appropriate GAQL query │ │ • Executes query, gets results │ │ • Calls arithmetic tool for any calculations │ │ • Drafts response │ │ │ │ Output: { response, raw_query, raw_results, calculations } │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ VALIDATOR SUB-AGENT │ │ │ │ Prompt: "You are a skeptical reviewer. Your job is to catch │ │ errors before they reach the user. Check everything." │ │ │ │ Tools: │ │ • verify_query_source(intent, gaql) → right data source? │ │ • verify_arithmetic(expr, result) → recalculate independently │ │ • verify_consistency(data, response) → numbers match? │ │ • check_sample_size(n) → warn if too small │ │ │ │ Output: Annotations (does NOT block, only annotates) │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────────┐ │ FINAL OUTPUT │ │ │ │ Response + Annotations → User │ │ │ │ Example: │ │ "The 'ai assistant' keyword has 45 clicks, 3 QSUs, $127 CPA" │ │ ✅ Verified ⚠️ Small sample (n=3) │ └─────────────────────────────────────────────────────────────────┘

Why a Sub-Agent?

Independent reasoning — Can't be "convinced" by main agent's logic
Different prompt — Skeptical, adversarial, focused on catching errors
Own tools — Specialized for verification (arithmetic checker, query auditor)
Parallel development — Can improve validator without touching main agent

Validator Tools

Tool	Purpose	Example Check
`verify_query_source()`	Right data source for the question?	"keyword performance" → must use `keyword_view`
`verify_arithmetic()`	Recalculate independently	$643.55 ÷ 11 = ? (catch the $74.27 error)
`verify_consistency()`	Numbers in response match raw data?	Response says "45 clicks" — is that in the query result?
`check_sample_size()`	Warn if conclusions from tiny data	n=3 conversions → add warning

Annotation Types

Response: "The 'ai assistant' keyword has 45 clicks, 3 QSU conversions, and a CPA of $58.50 over the last 7 days. Performance is within target."

Annotations:
✅ Query source verified (keyword_view) ✅ Arithmetic verified ($175.50 ÷ 3 = $58.50) ⚠️ Small sample size (n=3)

Response: "Search term 'lindy ai' drove 12 conversions at $42 CPA."

Annotations:
✅ Query source verified (search_term_view) 🔴 Arithmetic mismatch: $504 ÷ 12 = $42.00, but query shows $38.50

Arithmetic Tool (Required for Main Agent)

The main agent cannot do math in-context. It must call the arithmetic tool:

// Instead of: "The CPA is $643.55 / 11 = $74.27" (WRONG)
// Agent calls:
calculate({
  operation: "divide",
  a: 643.55,
  b: 11,
  label: "CPA"
})
// Returns: { result: 58.50, label: "CPA", expression: "643.55 ÷ 11" }

This ensures arithmetic is always correct, and the validator can verify the tool was used.

📈 Evals Roadmap

Once the validator is running, we build a test suite from real failures:

Eval Categories

Category	What We Test	Source
Query Source	keyword_view vs search_term_view selection	Historical confusions
Arithmetic	CPA, totals, percentages	LEARNINGS.md errors
Consistency	Numbers in response match query	User-reported issues
Edge Cases	Zero conversions, missing data	Synthetic tests
Warnings	Should trigger sample size warning?	Defined thresholds

Process:

Every time we catch an error, add it to the eval suite
Run evals on validator changes
Track accuracy over time
Evals become the spec for validator behavior

📁 Target File Structure

The proposed project structure, organized by responsibility:

google-ads/ ├── .env # API credentials (create from .env.example) ├── .env.example # Template with required vars ├── README.md # Project overview │ ├── .claude/ # 🔗 Symlink to source repo's .claude/ │ ├── CLAUDE.md # Main context file │ ├── LEARNINGS.md # Corrections — READ FIRST! │ ├── GOALS.md # Mission and targets │ ├── CONTEXT.md # Competitors, account info │ ├── REPORT_STYLE.md # Formatting guide │ └── skills/ │ └── google-ads-query/ │ └── SKILL.md # Full query patterns │ ├── tools/ # Agent tools │ ├── __init__.py │ ├── arithmetic.py # calculate() — all math goes here │ ├── query.py # GAQL query execution │ └── sheets.py # Google Sheets integration │ ├── validator/ # Validator sub-agent │ ├── agent.py # Sub-agent orchestration │ ├── prompt.md # Skeptical reviewer prompt │ └── tools/ │ ├── __init__.py │ ├── verify_arithmetic.py # Recalculate independently │ ├── verify_query_source.py# keyword_view vs search_term_view │ ├── verify_consistency.py # Numbers match raw data? │ └── check_sample_size.py # Warn on small n │ ├── evals/ # Test suite │ ├── run_evals.py # Eval runner │ ├── test_arithmetic.py # Arithmetic error cases │ ├── test_query_source.py # Data source confusion cases │ ├── test_consistency.py # Response/data mismatch cases │ └── cases/ # JSON test fixtures │ ├── arithmetic_errors.json │ ├── query_source.json │ └── from_learnings.json # Converted from LEARNINGS.md │ ├── scripts/ # 🔗 Existing scripts (symlinked) │ └── client.py # Google Ads API client │ └── service-account.json # 🔗 Google service account (symlinked)

Key Changes from Current Structure

Current	Proposed	Why
`.context/` (simplified copies)	`.claude/` (symlink to source)	Single source of truth, includes LEARNINGS.md
Math in LLM context	`tools/arithmetic.py`	Deterministic, testable, can't hallucinate
No validation	`validator/` sub-agent	Catches errors before they reach user
No tests	`evals/` suite	Learn from mistakes, prevent regressions
Separate Slack bot process	OpenClaw handles directly	No subprocess, no CLI hacks — native integration

✅ Action Plan

Phase 1: Foundation (This Week)

Set up Google Ads API credentials (developer token, OAuth flow)
Create .env file with all required credentials
Fix context split-brain: symlink .claude/ folder
Install gspread, share target sheets with service account

Phase 2: Core Tools (Next Week)

Build calculate() arithmetic tool
Build validator sub-agent with initial prompt
Implement verify_arithmetic()
Implement verify_query_source()
Implement verify_consistency()

Phase 3: Integration (Week 3)

Wire validator sub-agent into response flow
Add annotation formatting for Slack output
Build initial eval suite (10-20 cases)
Add check_sample_size() warnings

Phase 4: Ryan's Report (Week 4)

Get Ryan's current report format
Map report sections to query patterns
Build report generator with validation
Expand eval suite based on report needs

📊 Success Metrics

Metric	Current	Target
Arithmetic errors reaching user	~10%	0% (caught by validator)
Wrong data source	Unknown	0% (validator checks)
Responses with annotations	0%	100%
Eval suite coverage	0 cases	50+ cases
Can replicate Ryan's report	No	Yes

🎯 Design Principles

Flexibility with Guardrails

The main agent stays flexible — it can handle novel questions. The validator adds a safety net without constraining what questions we can answer.

Annotate, Don't Block

The validator adds transparency, not friction. Users see confidence levels. Only critical errors (like arithmetic mismatches) get flagged prominently.

Learn from Mistakes

Every error becomes an eval case. The system gets smarter over time. Evals are the spec — if we can't test it, we can't guarantee it.

Separation of Concerns

Main agent: understanding + querying + narrating. Arithmetic tool: all calculations. Validator: all verification. Each piece can improve independently.