Ephemerent Ephemeral · Emergent
Research agenda  —  R1–R7, on the record

Seven directions, three frontiers, one question.

Ephemerent studies how useful intelligence emerges, briefly, from systems of small agents — many minds spun up for a moment, cooperating, judged, and gone, leaving only the result. Our seven research directions are not seven bets; they are three fundable frontiers, each with a product that expresses it and a public artifact that proves it. The discipline that keeps the lab honest is one line: LLMs now, layers later. We ship LLM-powered, legible coding agents today, and research the layers that integrate alongside them — never as a replacement, always as a measured addition with an eval that proves it earned its place.

Directions
R1 — R7
Frontiers
Coordination · Verification · World models
Discipline
LLMs now, layers later
Horizon
Eight quarters · 2026 — 2028
Frontier A

Emergent multi-agent coordination.

The open gap: today's interoperability protocols assume the problem is already solved. A2A v1.0 and MCP both encode static roles and pairwise hand-offs — neither has a framework for spontaneous collaboration, distributed consensus, or emergent task-sharing. Gossip-protocol-style decentralized coordination is a named open gap. For a lab called Ephemerent, this is the natural home and the most distinctive claim we can stake. Over eight quarters we will instrument Orrery's existing parallel-worktree runs as coordination traces, prototype a gossip-style consensus layer where agents exchange intermediate diffs and converge without a central judge, and scale that layer to Colony at fleet scale. The credible public output: an open multi-agent coordination dataset (anonymized Orrery traces) and a tech report on gossip vs. central dispatch — gossip coordination is marked as a bet, advisory-first, promoted only when it beats central dispatch on the benchmark.

R1 Emergent orchestration Decomposing a goal into parallel agents, running each in isolation, and merging only what works. Orrery already does this with a central dispatcher; the research question is what coordination looks like with no dispatcher at all — agents that share partial results, vote, and re-allocate work via gossip. Product: Colony.
R7 Distributed compute mesh A hive-style network where contributors parallelize training and inference across their own GPUs — async gradient averaging, fault-tolerant participation, credit-backed rewards — so coordination is tested at fleet scale, not just on one laptop. Datacenter-scale ambition without datacenter monopoly. Product: Colony (marked as a bet).
Frontier B

Verification & evaluation as the hard problem.

The open gap: capability is saturating. Opus-class models reach ~88% on SWE-bench Verified, so value migrates from generating answers to trusting them. Yet only ~52% of production teams run evals at all; the lab-to-production gap runs ~37%; and there is a documented 50× cost variance for equal accuracy. The weakest-covered cases are exactly ours — multi-agent, long-horizon, self-improving. Over eight quarters Arbiter becomes the through-line: we ship execution-based scoring as Orrery's default judge (winners chosen by tests passed, not by an LLM eyeballing a diff), build a public eval harness with both step-level and outcome-level scoring, and stand up a drift leaderboard that re-measures solve-rate as cloud models change underneath us. Vellum is the spatial expression — replacing theatrical proofs with a real WASM software rasterizer so "prove, don't eyeball" is literally true, with no GPU in the loop. These are commitments, not bets: extensions of work already running in Orrery and Vellum. The credible public output: a public eval harness (step + outcome), a multi-agent verification benchmark anyone can run, and a short paper on execution-based selection vs. LLM-judge.

R2 Verifiable selection Choosing the best attempt by execution, not impression — panels, metrics, and judges instead of guesswork, with step-level scoring (was each split, tool-call, and hand-off sound?) and outcome-level scoring (did the merged result work?). Products: Arbiter (the verification layer) and Vellum (spatial verification).
Frontier C

World models as a planning substrate.

The open gap: 2026 mainstreamed world models — Genie 3, World Labs / Marble, LeCun's AMI direction. For code, the analogue is an agent that simulates execution and repo state before acting — simultaneously a planning primitive and a verification primitive: predict the consequence of a patch, prune the bad branches before spending tokens on them. Over eight quarters Seed is the bet, and we mark it as a bet. We will collect patch-consequence data from Orrery runs (proposed diff → test/build outcome), train a small predictor that estimates a patch outcome before execution and use it to prune best-of-N candidates, and explore latent and multimodal reasoning and spatial grounding where Vellum's geometry gives a signal. Seed runs advisory-only first — it ranks candidates; the LLM and Arbiter still decide. A pre-execution patch predictor lands behind a flag by Q1 2028, integrated alongside (never replacing) the LLM agents, and only promoted if it improves solve-rate-per-token on Arbiter's harness. The credible public output: an open patch-consequence dataset and a tech report on pre-execution pruning. R6 — wave/RF/photonic analog compute — is the longest-horizon bet and stays explicitly research-only across this window.

R3 Latent world models Computational state — repos, runtimes, proof states — encoded and rolled forward in embedding space, scoring candidates and imagining outcomes before real execution. Product: Seed (marked as a bet).
R4 Multimodal latent reasoning Text, code, and vision sharing a representation space for planning and verification — alongside the LLMs that still emit the final code and proofs. Multimodal latent reasoning probes scheduled mid-window; integrated only when measured to earn its place.
R5 Spatial & embodied agents Agents that build and check things you can see — shaders, parts, and scenes, verified without a GPU. Vellum's geometry gives a grounded signal; visual context as a future layer on the same agent stack.
R6 Wave compute substrates RF and photonic analog accelerators — matrix operations in electromagnetic waves instead of shuttled electrons. Algorithm–hardware co-design, not bigger GPUs alone. The lab's longest-horizon bet; explicitly research-only across this window.
On the record

What is shipped versus what is a bet.

To stay honest: execution-based verification, the eval harness, the drift leaderboard, and Vellum's real rasterizer are commitments — extensions of work already running in Orrery and Vellum, and they fund themselves by making the product more trustworthy. Gossip-style emergent coordination, Colony at fleet scale, Seed's software world model, and every line of R6 are bets — flagged, advisory-first, and promoted only when an Arbiter eval proves they beat the LLM-only baseline on solve-rate-per-token. Nothing replaces the LLM. Each layer is added when, and only when, it is measured to earn its place. That is the whole discipline: layers later, but only the layers that pass the test.