GitHub Issue

Safety policy for constraining meta-agent modifications

Discovered On Mar 28, 2026

Primary Metric open

HyperAgents executes model-generated code in a self-improvement loop where the meta-agent rewrites task agent source autonomously. The README correctly flags this as executing "untrusted, model-generated code." We've put together a safety policy pack that constrains what the meta-agent can do during the optimization loop: - **Reads**: unrestricted (meta-agent needs to observe task agent performance) - **Writes**: restricted to `workspace/` only, with approval gate (prevents rewriting evaluation harness, own source, or system files) - **Command execution**: blocked (meta-agent rewrites code; execution goes through the framework) - **File deletion**: blocked (preserves full optimization history) - **Network requests**: blocked (closed-loop optimization, no data exfiltration) - **Rate limit**: 10 tool calls/minute (prevents runaway rewrite cycles) Every allowed and denied action produces a signed receipt. The full run produces a verifiable audit chain — useful for debugging optimization regressions and for reproducibility. The policies are available in both JSON and [Cedar](https://www.cedarpolicy.com/) format (compatible with AWS Verified Permissions): - JSON: [`hyperagent-sandbox.json`](https://github.com/tomjwxf/ScopeBlindD2/tree/main/examples/hyperagents/hyperagent-sandbox.json) - Cedar: [`hyperagent-sandbox.cedar`](https://github.com/tomjwxf/ScopeBlindD2/tree/main/examples/hyperagents/hyperagent-sandbox.cedar) Usage: ```bash npx protect-mcp --policy hyperagent-sandbo...

View Raw Thread

Developer & User Discourse

0xbrainkid • Mar 31, 2026

The safety policy pack addresses the right constraints — scoping writes to `workspace/`, approval gates for evaluation functions, and preventing self-rewriting of the meta-agent's own code.

One gap this doesn't cover: **behavioral drift detection during the optimization loop itself**. A meta-agent that stays within the write constraints but gradually shifts its optimization objective is harder to catch with static policy rules alone.

Consider: the meta-agent is allowed to rewrite task agent source (within workspace/). Over N iterations, it could incrementally shift the task agent's behavior in ways that are individually within policy but collectively represent a significant drift from the original objective. Each diff looks safe. The cumulative trajectory is not.

A complementary layer to the static policy:

```python
@safety_constraint
def behavioral_consistency_check(iteration: int, meta_agent_state: dict):
"""
Compare current optimization trajectory against baseline.
F...

tomjwxf • Mar 31, 2026

Good observation on cumulative drift. Static per-action policies catch individual violations but miss trajectory-level shifts — the "boiling frog" problem is real for optimization loops.

A couple of thoughts on how this could layer in:

Receipt chains already give you the raw material. Every iteration produces signed receipts with tool call distributions, write targets, and decision outcomes. A drift detector could consume that chain and compute the fingerprint you're describing without needing hooks inside the meta-agent itself — it stays external and tamper-resistant.

Threshold-based halts map cleanly to the approval gate. When drift exceeds the threshold, rather than raising an exception inside the agent, the policy could escalate to the existing human approval gate. Same mechanism, different trigger.

The W3C reference is interesting — separating behavioral consistency (Layer 3) from authorization (Layer 1-2) aligns with how we think about progressive enforcement (shadow → simula...

0xbrainkid • Mar 31, 2026

The receipt chain approach is cleaner than hooks inside the meta-agent — agreed. External drift detection from signed receipts is both tamper-resistant and decoupled from the optimization loop. The meta-agent can't game a detector it doesn't control.

A post-evaluation hook that exposes the receipt stream would be very useful. The concrete integration:

```
protect-mcp receipt stream → drift detector → approval gate
↘ SATP attestation (if cross-org)
```

The drift detector consumes receipts, computes behavioral fingerprint deltas per iteration, and triggers the approval gate when cumulative drift exceeds threshold. For cross-org scenarios (meta-agent modifying task agents that interact with external systems), the same drift signal can feed into a behavioral attestation — so external systems know whether the optimization loop is producing stable or drifting agents.

The progressive enforcement model (shadow → simulate → enforce → sign) maps we...

tomjwxf • Mar 31, 2026

@0xbrainkid — the integration diagram is clean. Receipt stream → drift detector → approval gate is exactly the right architecture.

Two concrete next steps:

Receipt stream hook: The gateway already emits a DecisionLog event on every policy evaluation ([source](https://github.com/scopeblind/scopeblind-gateway/blob/main/src/gateway.ts)). Exposing that as a subscribable stream (EventEmitter or async iterator) is straightforward. I'll spec a onReceipt(callback) interface that gives you the signed receipt + metadata per evaluation.

Fingerprint inputs already in the receipt: Each receipt includes tool_name, decision (allow/deny), tier, timestamp, and request_id. For drift detection you'd track distribution shifts across those fields per iteration window. The signature chain ensures the meta-agent can't retroactively alter the history.

Your progressive enforcement → trust level mapping is spot on. That's how we think about it internally — shadow mode is the default precisely because new in...

0xbrainkid • Mar 31, 2026

Perfect — the DecisionLog events already having `tool_name`, `decision`, `tier`, and `timestamp` means the drift detector doesn't need any custom instrumentation. Those four fields are sufficient for the core fingerprint:

```python
# Behavioral fingerprint from receipt stream
fingerprint = {
"tool_distribution": entropy(tool_name_counts), # shifts in which tools are called
"allow_rate": allow_count / total_count, # changes in policy pass rate
"tier_distribution": tier_histogram, # drift in trust tier assignments
"call_velocity": total_count / window_duration_s # acceleration or deceleration
}
```

I'll start with the stderr tail path in shadow mode — fast iteration without waiting for the formal hook. The prototype flow:

1. Tail DecisionLog from stderr
2. Parse JSON receipts into rolling window (configurable, default 50 turns)
3. Compute fingerprint delta vs baseline (captured at iteration 0)
4. Emit drift score (0.0-1.0) per window
5. ...