๐Ÿฆž Gary Redesign

Flexible + Validated Architecture

v2 โ€” Updated

January 30, 2026 โ€ข Prepared by Max

๐Ÿ“‹ Executive Summary

Gary (the Google Ads Slack bot) was hallucinating numbers. After analysis, we identified three root causes. This v2 proposal keeps the agent flexible while adding a validation layer โ€” avoiding the overcorrection of rigid, fixed tools.

โŒ Current State

  • LLM writes GAQL queries (sometimes wrong source)
  • LLM does arithmetic in-context (error-prone)
  • No verification before output
  • Context scattered across 6+ files
  • Missing API credentials

โœ… Proposed State

  • LLM keeps flexibility to write queries
  • Arithmetic via dedicated tool (can't mess up)
  • Validator sub-agent checks everything
  • Annotations show confidence level
  • Evals catch regressions over time

v2 Change Why Not Rigid Tools?

The v1 proposal suggested 10 fixed query templates. That's too constraining โ€” users ask novel questions, and we'd constantly be adding new tools. Instead:

Simplified No Separate Bot Needed

Gary was a separate Slack bot process that shelled out to Claude CLI. That's obsolete now โ€” OpenClaw already has native Slack access.

No subprocess hacks. No "recover_response_from_session" workarounds. Just native integration.

๐Ÿ” Root Cause Analysis

Critical No Verification Loop

Gary's architecture: Slack โ†’ Claude CLI โ†’ Response. If Claude hallucinates, nothing catches it.

Evidence from LEARNINGS.md:

User caught error where $643.55 รท 11 = $49.41, not $74.27
โ†’ Gary queried correctly but botched the arithmetic

High Context Sprawl

The same rules appear in 6+ locations with subtle differences. Working from projects/ folder means missing LEARNINGS.md โ€” the file with corrections!

High keyword_view vs search_term_view Confusion

Called out six times in docs as "the #1 source of wrong numbers." The validator will explicitly check for this.

Critical Missing API Credentials

ServiceStatusMissing
Google AdsโŒ IncompleteDeveloper token, OAuth, refresh token
Google Sheetsโš ๏ธ Partialgspread not installed, sheets not shared
Google DocsโŒ NoneEverything

๐Ÿ—๏ธ Proposed Architecture: Validator Sub-Agent

The key insight: Keep the main agent flexible, but run every response through a skeptical validator before sending.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ USER QUESTION โ”‚ โ”‚ "How is the ai assistant keyword doing?" โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ OPENCLAW (flexible, native Slack) โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Understands the question โ”‚ โ”‚ โ€ข Writes appropriate GAQL query โ”‚ โ”‚ โ€ข Executes query, gets results โ”‚ โ”‚ โ€ข Calls arithmetic tool for any calculations โ”‚ โ”‚ โ€ข Drafts response โ”‚ โ”‚ โ”‚ โ”‚ Output: { response, raw_query, raw_results, calculations } โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ VALIDATOR SUB-AGENT โ”‚ โ”‚ โ”‚ โ”‚ Prompt: "You are a skeptical reviewer. Your job is to catch โ”‚ โ”‚ errors before they reach the user. Check everything." โ”‚ โ”‚ โ”‚ โ”‚ Tools: โ”‚ โ”‚ โ€ข verify_query_source(intent, gaql) โ†’ right data source? โ”‚ โ”‚ โ€ข verify_arithmetic(expr, result) โ†’ recalculate independently โ”‚ โ”‚ โ€ข verify_consistency(data, response) โ†’ numbers match? โ”‚ โ”‚ โ€ข check_sample_size(n) โ†’ warn if too small โ”‚ โ”‚ โ”‚ โ”‚ Output: Annotations (does NOT block, only annotates) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FINAL OUTPUT โ”‚ โ”‚ โ”‚ โ”‚ Response + Annotations โ†’ User โ”‚ โ”‚ โ”‚ โ”‚ Example: โ”‚ โ”‚ "The 'ai assistant' keyword has 45 clicks, 3 QSUs, $127 CPA" โ”‚ โ”‚ โœ… Verified โš ๏ธ Small sample (n=3) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why a Sub-Agent?

Validator Tools

ToolPurposeExample Check
verify_query_source() Right data source for the question? "keyword performance" โ†’ must use keyword_view
verify_arithmetic() Recalculate independently $643.55 รท 11 = ? (catch the $74.27 error)
verify_consistency() Numbers in response match raw data? Response says "45 clicks" โ€” is that in the query result?
check_sample_size() Warn if conclusions from tiny data n=3 conversions โ†’ add warning

Annotation Types

Response: "The 'ai assistant' keyword has 45 clicks, 3 QSU conversions, and a CPA of $58.50 over the last 7 days. Performance is within target."
Annotations:
โœ… Query source verified (keyword_view) โœ… Arithmetic verified ($175.50 รท 3 = $58.50) โš ๏ธ Small sample size (n=3)
Response: "Search term 'lindy ai' drove 12 conversions at $42 CPA."
Annotations:
โœ… Query source verified (search_term_view) ๐Ÿ”ด Arithmetic mismatch: $504 รท 12 = $42.00, but query shows $38.50

Arithmetic Tool (Required for Main Agent)

The main agent cannot do math in-context. It must call the arithmetic tool:

// Instead of: "The CPA is $643.55 / 11 = $74.27" (WRONG)
// Agent calls:
calculate({
  operation: "divide",
  a: 643.55,
  b: 11,
  label: "CPA"
})
// Returns: { result: 58.50, label: "CPA", expression: "643.55 รท 11" }

This ensures arithmetic is always correct, and the validator can verify the tool was used.

๐Ÿ“ˆ Evals Roadmap

Once the validator is running, we build a test suite from real failures:

Eval Categories

CategoryWhat We TestSource
Query Source keyword_view vs search_term_view selection Historical confusions
Arithmetic CPA, totals, percentages LEARNINGS.md errors
Consistency Numbers in response match query User-reported issues
Edge Cases Zero conversions, missing data Synthetic tests
Warnings Should trigger sample size warning? Defined thresholds

Process:

  1. Every time we catch an error, add it to the eval suite
  2. Run evals on validator changes
  3. Track accuracy over time
  4. Evals become the spec for validator behavior

๐Ÿ“ Target File Structure

The proposed project structure, organized by responsibility:

google-ads/ โ”œโ”€โ”€ .env # API credentials (create from .env.example) โ”œโ”€โ”€ .env.example # Template with required vars โ”œโ”€โ”€ README.md # Project overview โ”‚ โ”œโ”€โ”€ .claude/ # ๐Ÿ”— Symlink to source repo's .claude/ โ”‚ โ”œโ”€โ”€ CLAUDE.md # Main context file โ”‚ โ”œโ”€โ”€ LEARNINGS.md # Corrections โ€” READ FIRST! โ”‚ โ”œโ”€โ”€ GOALS.md # Mission and targets โ”‚ โ”œโ”€โ”€ CONTEXT.md # Competitors, account info โ”‚ โ”œโ”€โ”€ REPORT_STYLE.md # Formatting guide โ”‚ โ””โ”€โ”€ skills/ โ”‚ โ””โ”€โ”€ google-ads-query/ โ”‚ โ””โ”€โ”€ SKILL.md # Full query patterns โ”‚ โ”œโ”€โ”€ tools/ # Agent tools โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ arithmetic.py # calculate() โ€” all math goes here โ”‚ โ”œโ”€โ”€ query.py # GAQL query execution โ”‚ โ””โ”€โ”€ sheets.py # Google Sheets integration โ”‚ โ”œโ”€โ”€ validator/ # Validator sub-agent โ”‚ โ”œโ”€โ”€ agent.py # Sub-agent orchestration โ”‚ โ”œโ”€โ”€ prompt.md # Skeptical reviewer prompt โ”‚ โ””โ”€โ”€ tools/ โ”‚ โ”œโ”€โ”€ __init__.py โ”‚ โ”œโ”€โ”€ verify_arithmetic.py # Recalculate independently โ”‚ โ”œโ”€โ”€ verify_query_source.py# keyword_view vs search_term_view โ”‚ โ”œโ”€โ”€ verify_consistency.py # Numbers match raw data? โ”‚ โ””โ”€โ”€ check_sample_size.py # Warn on small n โ”‚ โ”œโ”€โ”€ evals/ # Test suite โ”‚ โ”œโ”€โ”€ run_evals.py # Eval runner โ”‚ โ”œโ”€โ”€ test_arithmetic.py # Arithmetic error cases โ”‚ โ”œโ”€โ”€ test_query_source.py # Data source confusion cases โ”‚ โ”œโ”€โ”€ test_consistency.py # Response/data mismatch cases โ”‚ โ””โ”€โ”€ cases/ # JSON test fixtures โ”‚ โ”œโ”€โ”€ arithmetic_errors.json โ”‚ โ”œโ”€โ”€ query_source.json โ”‚ โ””โ”€โ”€ from_learnings.json # Converted from LEARNINGS.md โ”‚ โ”œโ”€โ”€ scripts/ # ๐Ÿ”— Existing scripts (symlinked) โ”‚ โ””โ”€โ”€ client.py # Google Ads API client โ”‚ โ””โ”€โ”€ service-account.json # ๐Ÿ”— Google service account (symlinked)

Key Changes from Current Structure

CurrentProposedWhy
.context/ (simplified copies) .claude/ (symlink to source) Single source of truth, includes LEARNINGS.md
Math in LLM context tools/arithmetic.py Deterministic, testable, can't hallucinate
No validation validator/ sub-agent Catches errors before they reach user
No tests evals/ suite Learn from mistakes, prevent regressions
Separate Slack bot process OpenClaw handles directly No subprocess, no CLI hacks โ€” native integration

โœ… Action Plan

Phase 1: Foundation (This Week)

Phase 2: Core Tools (Next Week)

Phase 3: Integration (Week 3)

Phase 4: Ryan's Report (Week 4)

๐Ÿ“Š Success Metrics

MetricCurrentTarget
Arithmetic errors reaching user~10%0% (caught by validator)
Wrong data sourceUnknown0% (validator checks)
Responses with annotations0%100%
Eval suite coverage0 cases50+ cases
Can replicate Ryan's reportNoYes

๐ŸŽฏ Design Principles

Flexibility with Guardrails

The main agent stays flexible โ€” it can handle novel questions. The validator adds a safety net without constraining what questions we can answer.

Annotate, Don't Block

The validator adds transparency, not friction. Users see confidence levels. Only critical errors (like arithmetic mismatches) get flagged prominently.

Learn from Mistakes

Every error becomes an eval case. The system gets smarter over time. Evals are the spec โ€” if we can't test it, we can't guarantee it.

Separation of Concerns

Main agent: understanding + querying + narrating. Arithmetic tool: all calculations. Validator: all verification. Each piece can improve independently.