Research Instrument · Human-AI Collaboration

Agentic Chicken with
Coaching-in-the-Loop

A research instrument that measures two questions the current literature cannot answer at scale: how humans calibrate autonomy when coaching AI agents, and whether sustained AI assistance preserves or erodes the engagement that makes human coaching worth keeping in the loop. The first question is studied at lab scale. The second has no empirical answer at any scale.

Researcher Arnold Ray Kagaoan Alagar Institution Tressellate · tressellate.dev Contact arnold.alagar@gmail.com Status Research instrument · v2 · April 2026

Two questions, one instrument

Question one: how do humans calibrate autonomy when coaching AI agents?

The applied literature on human-automation interaction is substantial — Cummings on supervisory control, Sarter and Woods on automation surprises, Hancock on trust calibration, Shneiderman on human-centered AI, Handa and others on reliance in language-model interaction. What this literature does not contain: large-N, sustained, ecologically valid measurement of autonomy calibration in human-AI agent coaching under controlled variation of the autonomy mix itself. Existing work is short-task lab studies, single-deployment observational data, or theoretical. ACCL holds the autonomy parameter α as a controlled experimental variable while the human-agent coaching relationship runs over hundreds of decision ticks per session, across weeks of voluntary engagement, in a cohort large enough to detect medium effect sizes.

Question two: does sustained AI assistance erode the engagement that makes coaching worth doing?

This question has no empirical answer at any scale. The closest relevant literatures point at it without addressing it directly: self-determination theory (Deci & Ryan) on competence as a basic psychological need that requires calibrated challenge; flow theory (Csikszentmihalyi) on engagement requiring challenge calibrated to skill; Calhoun's behavioral-sink work as a dramatic but methodologically limited touchstone for what happens when challenge is removed; deaths-of-despair research as a partial purposelessness-in-post-industrial-communities story; UBI experiments as limited evidence on engagement when economic challenge is reduced. None of these measure what ACCL is positioned to measure: whether α calibration in human-AI coaching, sustained over time, preserves or erodes the engagement and sense-of-contribution substrate that makes the coaching worth doing.

This is the load-bearing question for every AI deployment thesis that depends on human oversight remaining meaningful. If humans disengage past a certain α threshold — or disengage gradually under sustained high-α conditions even when they do not notice it themselves — then human-in-the-loop as a safety mechanism has a structural ceiling that no model improvement can lift. ACCL is the first instrument designed to measure where that ceiling is.

Why this instrument can measure both

Engagement at scale produces data quality lab conditions and paid-compliance cohorts cannot deliver. Foldit produced peer-reviewed protein structure work that algorithmic methods missed. EyeWire mapped neurons automated tools could not. Galaxy Zoo generated astronomical datasets professional researchers could not produce. The pattern is consistent: gamified participation with intrinsic stakes produces data density and quality that surveys, lab tasks, and paid cohorts structurally cannot reach. ACCL operationalizes this pattern for HITL measurement specifically — and because the engagement is itself a measurement target, the instrument's own engagement behavior becomes the data that answers question two.

A game that is also a measurement device

ACCL is structured as a competitive game in which human players coach AI agents through timed rounds. The game format is modeled on sabong — a Filipino cultural tradition in which a human invests in and coaches a semi-autonomous competitor toward competitive outcomes. In ACCL, the competitor is an AI agent, the coaching is digital, and nothing physical competes. The cultural structure is preserved because it produces engagement; the underlying practice is not.

The core variable is human-in-the-loop intensity. At each decision tick, the AI agent selects an action based on a weighted combination of its own policy and real-time input from its human coach. The autonomy parameter α sets the mix:

$$a^* = \arg\max_a \left[ (1 - \alpha) \cdot Q_{\text{autonomous}}(s, a) + \alpha \cdot Q_{\text{coached}}(s, a, c) + \varepsilon \right]$$

where c is the coaching input vector provided by the human at tick frequency, α ∈ [0,1] is the autonomy parameter controlling the autonomy/oversight mix, and ε is an exploration term whose distribution is held constant across all conditions.

ACCL runs three leagues at three α values. The same human coaches the same AI across all three leagues — with counterbalancing, order randomization, and washout protocols to separate α effects from learning transfer and fatigue. This is the cleanest empirical handle on the human-AI autonomy calibration problem currently constructible outside a lab.

0.20 League A — Low Coaching Weight

The agent relies primarily on its own policy. Human input carries 20% weight. Measures baseline agent behavior and human response to low-autonomy conditions.

0.40 League B — Balanced

A near-midpoint split, asymmetrically placed below 0.5. Serves as the primary calibration band for detecting coaching efficiency and mental model formation.

0.55 League C — High Coaching Weight

Human coaching drives more than half the action selection. Asymmetric spacing above midpoint reveals non-linear behavior in high-oversight conditions.

Why asymmetric spacing? The leagues are not evenly distributed around 0.5. The gap between A and B (0.20) is larger than between B and C (0.15). This is deliberate: the mid-range is where calibration difficulty is highest and where coaching style differences are most likely to emerge. Finer resolution there increases sensitivity to the effect of interest.

Five measurements the current literature lacks

Each measurement is designed to produce data that survey research, lab tasks, and paid-compliance cohorts structurally cannot generate. The game format is not ornamental — it is what makes these measurements possible.

What ACCL predicts

Each of the five measurements has an empirical hypothesis attached. The instrument is designed so any of the hypotheses can be falsified by the data; this section names them so the falsifiability surface is explicit.

On autonomy calibration

ACCL predicts that performance under controlled α variation produces a non-monotonic curve: performance is poor at α near 0 (agent under-coached, baseline noisy), improves through a calibration band roughly between 0.3 and 0.5, and degrades at high α (human input introduces noise the agent's policy would have avoided). The optimal α is hypothesized to shift downward with coaching experience, as humans learn when their input adds value and when it does not.

On mental model convergence

ACCL predicts that mental model accuracy improves with experience, but the rate of improvement depends on α: high-α conditions accelerate model formation (the human is forced to engage with agent behavior more closely) and low-α conditions slow it (the human has fewer opportunities to test their predictions against agent decisions).

On coaching style emergence

ACCL predicts that distinct coaching styles cluster into a small number of stable strategies (likely three to five) and that stylistic differences are stable across α conditions but produce different performance profiles. Specifically, "interventionist" styles are predicted to underperform at low α and outperform at high α; "delegating" styles are predicted to do the opposite.

On marketplace revealed preferences

ACCL predicts that the same agent commands measurably different prices across the three α leagues, with the highest valuations in the calibration band where coaching efficiency is highest. Marketplace pricing under low-α conditions is predicted to track agent capability directly; pricing under high-α conditions is predicted to track perceived coachability.

On engagement and purpose stability

ACCL predicts that engagement remains stable under α variation in the short term, but degrades over sustained sessions specifically at the high end of the α range — not because high α is intrinsically disengaging but because sustained heavy coaching produces fatigue without proportional sense-of-contribution gain. The prediction that matters most for the broader thesis: at moderate α (the calibration band), engagement is hypothesized to remain stable or improve over time, supporting the proposition that calibrated human-AI coaching is sustainable work.

If the data falsifies these predictions — particularly the engagement-stability hypothesis — the broader thesis that emerges in the program's longer-arc framing weakens substantially. ACCL is designed so the data answers the question regardless of which way it falls.

Why the Philippines, why BPO, why now

The macroeconomic stakes are sovereign-level, not sectoral.

The Philippine BPO sector and overseas worker remittances together account for roughly 18% of GDP and underwrite the consumption economy that supports much of the rest. The country hosts the world's largest concentration of the workforce most directly exposed to AI substitution in knowledge work. AI's effect on BPO employment is not a sectoral question for the Philippines — it is a question of macroeconomic continuity. Philippine fiscal policy through 2035 depends materially on whether the BPO transition produces managed adaptation or rapid displacement. No comparable research site exists for empirical work on AI's labor-economic effects: the exposure is concentrated, the workforce is measurable, the institutional research infrastructure is in place, and the policy stakes are immediate rather than abstract.

The Filipino BPO population is the population the second question is about.

Question two — whether sustained AI assistance erodes the engagement that makes human coaching worth doing — is not an abstract question for this cohort. It is the operational question their next decade of work will answer either way. 1.5 million Filipino BPO workers are currently transitioning from executing tasks to managing agents that execute tasks. Whether that transition produces meaningful work or managed disengagement is the load-bearing question for Philippine economic policy through 2035. ACCL measures this question in the population it is about, using a format that population already understands, before the answer is locked in by deployment defaults nobody has measured.

1.5M
Filipino BPO workers directly exposed to AI-driven task automation
120
Target cohort across three leagues (power analysis for Cohen's d ≈ 0.5)
~0
Learning threshold for Filipino participants: sabong is the native game format

Sabong is the culturally native format for this population

Any gamified research instrument has to choose a game. Choosing one with existing cultural continuity in the target population lowers the engagement threshold to near zero. Filipino participants do not need to learn what the game is — they only need to learn the platform. This is not a claim about cognitive transfer from the cultural format to LLM coaching. It is a simpler claim about engagement quality.

The Philippines has a specific economic stake in the skill the game measures

BPO work is transitioning from executing tasks to managing agents that execute tasks. The skill that transition requires is exactly what α measures: how much to let the agent decide, how much to intervene, when to override. A research instrument that studies this skill in a Filipino population, using a format that population already understands, is measuring a real future — not an abstract one.

Institutional partners identified: De La Salle University and Ateneo de Manila University School of Social Sciences are the prospective IRB venues. University of the Philippines Diliman is identified for post-fellowship continuation. IRB review will be secured prior to any cohort enrollment.

A layered substrate that attests its own observations

ACCL is built on a layered framework modeled on the OSI networking reference design. Each layer provides defined services to the layer above through clean interfaces. The architecture exists for a specific reason: empirical research on AI systems currently relies on researcher reputation and journal review for trust. As the deployments studied become higher-stakes, that substrate is insufficient.

Cryptographic attestation at source and deterministic replay shift the verification burden from reputation to mechanism. ACCL's research outputs are verifiable by construction, not by attribution.

L5
Application
The game, the marketplace, and the α parameter. The research interface that participants interact with directly. All observable behavior at this layer is recorded and attested by L2–L4.
L4
Verification & Query
Research query interface. Allows deterministic reconstruction of any session from attested primitives. Enables independent replication of findings from the ledger record without access to the application layer.
L3
Distributed Ledger Anchoring
Hedera Hashgraph anchoring of attested decision primitives. Provides deterministic replay: any session can be reconstructed exactly from the ledger record. Eliminates data provenance disputes.
L2
Cryptographic Attestation
Each decision primitive — human input vector c, agent state s, action a, and timestamp — is cryptographically signed at source before propagation. No post-hoc data manipulation is possible.
L1
Physical Measurement
Decision ticks at 30 Hz. Human input and agent state sampled at consistent intervals. The tick rate is held constant across all three leagues to eliminate timing artifacts from α comparisons.

This framework is one instance of a broader layered architecture for trust infrastructure that operates across physical infrastructure, financial instruments, carbon markets, and regulated services. ACCL is uniquely positioned as the only application that exercises the full framework in a single operational context at sufficient data density to expose inter-layer interaction failures — under conditions where exposing them has no physical cost.

Two US provisional patents filed April 2026 (patent agent: Steve Shattil) document the architectural substrate at a formal level.

Methodology and fellowship deliverables

Cohort
120 participants across three α leagues, recruited from Filipino gaming and BPO-adjacent communities
Power
Designed to detect medium effect sizes (Cohen's d ≈ 0.5) on α-conditional performance differences
Controls
Counterbalancing, order randomization, and washout protocols to separate α effects from learning transfer and fatigue
IRB
Review through De La Salle University or Ateneo de Manila SSS prior to any cohort enrollment
Compensation
Structured as game entry credit — separates engagement from outcome-contingent payment and gambling framing
MVP Stack
SvelteKit / TypeScript front-end · Python data pipeline · Rust performance-critical components · Supabase/Postgres with RLS · Hedera attestation

Six-month research timeline

Month 1
MVP completion & IRB submission
Finalize the α-weighted decision function, three-league structure, attestation infrastructure, and simplified marketplace. Submit IRB application to Philippine institutional partner.
Month 2
Pilot cohort & instrument validation
Run pilot cohort (20–30 participants) to validate measurement sensitivity and refine α-league counterbalancing. Preliminary analysis of performance and mental model convergence data.
Month 3
Full cohort enrollment
Enroll remaining participants. Run all three leagues with full counterbalancing. Marketplace goes live for revealed-preference measurement.
Months 4–6
Analysis & first paper
Full analysis across all five measurements. Draft first paper on α-parameter findings. Establish ACCL as a replicable method with documented protocols for continuation beyond the grant cycle.
Fallback scope. If MVP completion slips, the fellowship deliverable narrows to the two measurements with least marketplace dependency — performance under adversarial α and mental model convergence. Coaching style emergence and revealed-preference measurements defer to post-fellowship continuation. The instrument's core contribution — the α-controlled three-league design — is preserved in all scenarios.

What I bring and what I'm looking for

Control systems engineering, applied to safety-critical physical environments. Core formation at Edwards Air Force Base (advanced tracking systems), Groom Lake (classified data acquisition), and NASA Ames (Final Approach Spacing Tool — a neural network deployment into live terminal-area air traffic control). The problem across all of those environments was the same: how do you bound the behavior of a system that learns from data, in an operational context where being wrong has physical consequences?

Co-founder of a two-person company that received NASA SBIR Phase 2 and DOD/DARPA SBIR Phase 2 awards in 1999, with full engineering and operational responsibility for reducing inventive mathematics to tested hardware.

The past decade has been self-funded development of the architectural thesis this work instantiates: that AI safety for physical-world systems is better approached as a control engineering problem than as a preference-learning problem, and that the layered-architecture discipline control engineering developed over seventy years transfers to AI systems when those systems are properly instrumented. Two US provisional patents filed April 2026 capture that substrate. ACCL is one application of it.

What I am looking for: institutional partners willing to support empirical research at the scale this question requires. The Anthropic Economic Futures program's combination of research grants, longitudinal data infrastructure, and policy symposia is the closest existing match for the institutional support this work needs. The exchange is bidirectional. ACCL produces empirical evidence on questions the program's published framing identifies as central — work, meaning, and what the AI-enabled economy requires of human contribution — and the data infrastructure pillar gains a longitudinal research site in the country where those questions are most economically consequential. Parallel institutional anchoring through the Philippine Department of Science and Technology's Balik Scientist Program (under the Council for Industry, Energy, and Emerging Technology Research and Development) provides the domestic counterpart structure that makes sustained operation possible.

Research alignment

ACCL extends and complements an existing body of Anthropic-affiliated empirical work on human-AI interaction and the economic effects of AI deployment.

Anthropic Economic Index. ACCL's longitudinal data on autonomy calibration, engagement stability, and revealed-preference dynamics in a Filipino BPO-adjacent cohort feeds directly into the Index's geographic-and-enterprise reporting. Philippine BPO is the highest-signal site for this measurement currently identifiable; no other comparable economy combines the workforce concentration, English fluency, institutional research infrastructure, and macroeconomic exposure necessary for the Index's longitudinal questions to land sharply.

Skill formation under AI assistance (Tamkin, Shen, and collaborators). The learning-curve findings the α parameter will produce extend the existing skill formation literature into the human-AI agent coaching context, where the cognitive task is qualitatively different from coding-with-AI assistance.

Reliance and trust calibration in language model interaction. ACCL's mental model convergence and coaching style measurements operationalize the reliance question in a sustained-engagement setting, producing data on how trust calibration evolves over hundreds of decision ticks rather than across short laboratory tasks.

Societal impacts of AI deployment in labor markets. The BPO transition is not a generic labor-market question. It is the most empirically tractable instance of the broader question the program's framing names directly: what new capabilities emerge when humans work alongside AI systems, and which human skills remain valuable as AI advances.

Continuity beyond the grant cycle

ACCL continues after the initial six-month grant cycle through institutional anchors at De La Salle University, Ateneo de Manila University, and University of the Philippines Diliman, supplemented by Balik Scientist Program designation under DOST-PCIEERD and parallel funding pursued through NSF Future of Work at the Human-Technology Frontier and other multi-year vehicles. Six months is sufficient to complete the MVP, run the initial cohort, produce a first paper on the α findings, release the instrument under permissive licensing for replication, and establish the longitudinal data infrastructure for sustained operation. The substantive research program — understanding how human-AI collaboration configurations reshape value creation in Philippine BPO and the labor markets that depend on it — operates on a multi-year horizon. The first grant cycle establishes the foundation; the work it enables is what matters.

References available on request: Steve Shattil (NASA SBIR co-executor and patent agent for the filed provisional patents). Additional references from Philippine academic host institutions (DLSU, Ateneo, UP) and DOST-PCIEERD available once host institution arrangements are formalized.