Protocol Agent: Future Research Directions

marco@ag0.xyz

Now

Protocol Agent: What If Agents Could Use Cryptography in Everyday Life (arXiv, X.com) frames a practical “improvement stack” for agents operating in open networks: (1) infrastructure that lets agents discover each other and establish trust - either via closed, curated catalogs or through open ecosystems anchored in blockchain-based distributed registries (e.g., ERC-8004); (2) standards that specify message formats and interfaces so requests and responses are packaged consistently and work across implementations; (3) behaviors and techniques: what agents actually say and do, turn by turn, to achieve their goals; (4) learning loops: how past interactions (their own and others’) are turned into improvements via context engineering and post-training.

Throughout this note, we refer back to these as layers (1)–(4).

This paper currently focuses on layer (3). The core premise is that agents have capabilities that are structurally different from humans - deterministic memory, speed, and on-demand computation - so their communication techniques should eventually diverge from human conversational defaults. That shift hasn’t meaningfully happened yet, largely because today’s models are trained on multi-turn human conversations (and therefore inherit human habits and inefficiencies).

More specifically, the paper studies models that could bring advanced cryptography into everyday agent interactions. It introduces a benchmark and an arena to measure end-to-end performance across: (a) recognizing the right cryptographic primitive family, (b) negotiation/persuasion to secure counterpart buy-in, (c) correct protocol execution, (d) correct cryptographic computation/tool use, and (e) security strength. It also proposes a dataset generation pipeline to improve models on this benchmark, runs supervised fine-tuning (SFT), and shows large gains on several leading open-weight models.

Next steps

Here’s what we’re considering right now:

Direct continuation

RL with an LLM judge (layers (3) + (4))

RLAIF: Keep the current rubric, and use the judge outputs to directly optimize end-to-end behavior (selection + negotiation + execution + security) under a fixed interaction budget. A practical approach could be to start offline with preference optimization (e.g., DPO/IPO) using judge-ranked candidates, since it’s cheap and stable for fast iteration. Then, once the pipeline is solid, move on-policy to PPO-style RLAIF with KL control to the SFT reference, so we optimize the behavior distribution we actually get in live multi-turn play (especially tool use and long-horizon protocol execution).

RL with verifiable rewards (layers (3) + (4))

Add deterministic reward components that are objectively checkable (tool validity, verifier checks, policy-constraint compliance), so learning does not rely entirely on an LLM judge.

Tool correctness: reward successful `cryptomath` calls and punish invalid ops/args; punish inventing values instead of calling the tool.
Artifact usage: reward when returned artifacts are actually used later in the protocol (e.g., hashes/signatures referenced consistently).
Verifier pass/fail: whenever a protocol has a natural verifier (signature verify, commitment open, Merkle proof verify), reward passing verification and penalize failure.
Secret leakage constraints: make “what is secret” explicit per role (inputs, keys, identity attributes) and run deterministic transcript checks for forbidden disclosures (exact strings / structured values).
Intermediate milestones: design scenarios with explicit intermediate targets, so reward can be denser than only final success.

Beyond self-play (layer (3))

When both sides are tuned “protocol agents,” adoption can be smoother than in many real deployments. We should therefore also evaluate asymmetric pairings: a protocol agent interacting with a non-tuned model (or a model optimized for “convenience over privacy”). This puts more realistic pressure on negotiation and explanation.

Under an honest-but-curious (not fully adversarial) threat model, the protocol agent may rationally help the counterparty implement steps and do the math (without leaking secrets) because this increases the chance of reaching a verifiable agreement within the turn budget. Measuring this “assist without oversharing” behavior is itself valuable.

Better datasets, better benchmark scenarios, more models (layer (3) only)

Expand coverage (including state-of-the-art commercial models) and enrich scenarios to better reflect deployment realities: partial cooperation, ambiguous objectives, longer horizons, and failure modes that stress tool discipline and leakage prevention.

Adjacent directions

"The Explorer" (layers (1) + (4))

Create a benchmark for an agent that, without prior hints, can:

discover other agents via both centralized catalogs and decentralized discovery/trust systems,
understand what they do/sell and how to use their service safely,
execute real tasks by composing these services,
and self-improve (e.g., via RL or continual learning) based on interaction outcomes.

"Agent Jargon" (layers (3) + (4))

Explore whether post-training models to communicate in a more compact, structured “agent jargon” (more symbolic, more schematic, less verbose) improves performance in both Protocol Agent and The Explorer.