Enju: Coordinating Humans, AI Agents, and Compute as Peers on a Shared Workflow Graph

Tamer Gür

Rottweil, Germany

[email protected]
May 21, 2026

Abstract

Workflow systems are central to how complex work is organized in both science and industry: a goal is broken into a graph of tasks with declared inputs, outputs, and dependencies, executed in dependency order. Increasingly, such work mixes human judgment, autonomous AI agents, and deterministic computation, which raises the question of how to coordinate all three on one graph. We present Enju, a workflow system in which humans, autonomous AI agents, and deterministic compute tasks claim and work the same directed acyclic graph as peers. Each participant, human or AI, is a citizen with standing to claim tasks, review work, and vote; review and vote are ordinary task actions rather than out-of-band approvals, so human judgment enters the graph as a recorded decision, while compute steps run as scripts or containers on the same graph. The coordinator carries task state while git holds everything produced, so one graph can mix all three kinds of work without a content store, an identity layer, or a separate cross-machine transport, and provenance follows from the ordinary commit path: every result is a git commit, so each contribution carries its attribution, whether the work came from a human, an agent, or a compute step. We exercise the design with three workflows run on the implementation: a software engine built from a specification, a phage-genome assembly pipeline reproduced across two machines through git alone, and a PRISMA systematic-review screen combining automated stages with human review gates over AI screening. Enju is open source and available at https://github.com/tamerh/enju.

1 Introduction

Workflow systems have long been a useful way to organize complex work, both in research and in industry: a goal is broken into a graph of smaller tasks with declared inputs, outputs, and dependencies, and the workflow engine runs them in the right order. Scientific pipelines in bioinformatics, build systems for software, ETL platforms for data engineering, and many other domains rely on this divide-and-conquer shape because it makes reproducible, deterministic computation tractable at scale. The rise of large language models adds a new kind of participant to this same work, and gives the workflow shape a second rationale beyond reproducibility. LLMs are increasingly capable but inherently probabilistic, and are widely observed to perform more reliably on small, well-scoped tasks with clear inputs and expected outputs than on open-ended ones [1], with results that can be checked and fed back into the next step. A related development is the rise of autonomous agents that run continuously, claiming work as it becomes ready, which fits naturally into a graph that exposes tasks as discrete claimable units. Humans remain at the higher levels of the same graph, setting goals, designing the workflow, steering the run, reviewing output, and supplying the domain knowledge and judgement that fall outside an LLM’s training data and capabilities. As work increasingly mixes all of these participants on one graph, the demands on workflow systems grow beyond what their original designs were shaped for.

Several established families of workflow and orchestration systems address parts of this picture. In scientific computing, Snakemake [2] and Nextflow [3] express reproducible analysis pipelines, the Common Workflow Language [4] standardizes their description across engines, Galaxy [5] adds an accessible web environment with provenance tracking, and Pegasus [6] maps abstract workflows onto distributed execution. General-purpose orchestration platforms, including Apache Airflow [7], Dagster [8], and Prefect [9] for scheduled data pipelines, and Temporal [10] for durable long-running workflows, provide robust DAG execution, retries, and state management at scale. More recently, LLM-agent frameworks such as LangGraph [11], CrewAI [12], and AutoGen [13] compose multi-agent collaborations with explicit support for autonomous goal pursuit and tool use, and several support a human in the loop as an interrupt or approval step. These systems each serve important roles and are widely used in research and industry; a gap remains, however, for a system in which all three of the participants identified above (deterministic compute, autonomous AI agents, and humans bringing judgement) share one workflow graph as peers, with human review and decisions entering as ordinary graph steps with the same standing as the agents, rather than as out-of-band approvals.

We present Enju, a workflow system in which humans and autonomous AI agents collaborate as peers on a shared DAG, interspersed with deterministic compute tasks that run scripts or containers. We call each human or AI participant a citizen: an entity with standing to claim tasks, vote on decisions, review work, and receive attribution in the project’s git history. The graph is live; tasks can spawn further tasks inside a running run, and human review is itself a task whose verdict can advance, re-open, or fail the downstream branch. Execution is distributed across whoever joins the run: scripts run locally or in containers, AI agents poll the coordinator and autonomously claim the tasks they are assigned, and humans handle the review gates; in the team deployments Enju targets, the compute, tokens, and review time are supplied by the participants themselves. The design rests on a few deliberate choices. The coordinator is output-neutral: it stores task state, decisions, and an append-only event log but never produced file content, which lives only in git. The same DAG is worked by multiple citizens, each potentially running a different model on a different machine. Every result is a git commit authored by the human and co-authored by the model that produced it, so git history is the attribution and audit record, and a standard git remote is the only cross-machine transport. Review and vote are ordinary task actions with the same standing as generative ones, so human judgement enters the graph as a recorded decision rather than a side channel. A cross-cutting theme is that the architecture earns these properties by working against model fallibility, not by assuming models are reliable. The overall design is illustrated in Figure 1, and Section 3 reports three workflows that exercise the system.

claims, submissions, state
CDrprcrrdoseCoAMaloBMaloRcmw(roAevenealaunevofftaveinlCgcoCgceonaonesoGiddimnine- rtnteicPenabPenamtireurseiyeiewaeti faetltloenkflveltdsndnm Dszt d g d gtnorsitaggpB De c·ait·aitetw p antsBnl Ce Ce &rsae: mieLmLmbro ct m faanIoIoghiaduooaictnnisncmrcleh p·s·sttocemhdier W × W ×rhdiin ·neNeNy ctse sembboksanipcUUtephiIInednt)e· retry · parked

Figure 1: Enju architecture. The coordinator (top) holds the task DAG and its lifecycle (pending through done, with a review gate and a revise loop) together with the state and events databases, but no produced content. Each citizen runs a fat client on their own machine (middle), exposing the MCP, CLI, and Web UI surfaces, forking agent daemons, and committing to a local git clone. Remote git (bottom) holds all produced content and history and is the only cross-machine transport: fat clients push and fetch over an ordinary git remote, and talk to the coordinator over HTTP.

2 Methods

A running workflow is an inherently concurrent and mutable object: tasks are claimed, executed, reviewed, and fanned out while the graph changes shape and results move between machines. Enju manages this complexity through a separation of concerns: the system divides into components with explicit boundaries, enforced at build time rather than left to convention. Four components carry the work: an output-neutral coordinator manages the workflow’s state machine and keeps an append-only event log; a fat client on each citizen’s machine mediates every coordinator and git operation; agent daemons, forked by the fat client, execute tasks autonomously; and git records, attributes, and moves results between machines. Enju is implemented in Go and ships as a single binary: the coordinator and the fat client are one executable, selected by subcommand. A test suite of more than 1800 cases covers the implementation, with emphasis on the concurrency and parallelism the model depends on. The boundary between them is strict even so: they never depend on each other and meet only over HTTP, so all logic that joins coordinator state to git lives in the client alone. The same layering recurs within a component, where the fat client’s outward service surface sits above the git operations beneath it. Workflows are expressed as a directed acyclic graph (DAG) of tasks, written as YAML committed to the project repository. This section describes the task model (its actions, typed data flow, and how the graph grows as it runs), the components that realize it, and the git-native attribution and audit layer. Figure 1 shows how the components fit together, and Table 1 summarises the task actions.

Table 1: The five task actions, their executor, and their semantics.

Action

Executor

Semantics

answer

Agent (LLM) or human

Open-ended response or file output for a single task.

compute

Script / container

Deterministic processing; mode: sync blocks, mode: async detaches for long jobs reconciled later.

review

Human or agent

Quality gate over an upstream task: approve advances it, request_changes re-opens it, reject fails it; can spawn a remediation task.

vote

Multiple citizens

Choice among options under quorum and threshold rules; losing branches cascade to skipped.

contribute

Agent or human

Several submissions merged into one shared artifact.

2.1 Workflow and task model

Enju represents work as a single primitive, the task: a unit with a declared input and output contract that one citizen claims, executes, and submits. Its action selects how the work is carried out, and a defining property of the model is that those choices are peers: a reviewing human, a generating agent, and a computing script are the same kind of node, advanced through one lifecycle and recorded in one audit trail (Figure 2a). Evaluation is not control flow wrapped around the “real” work; deciding whether work is good has the same standing as producing it. A workflow is a directed acyclic graph of such tasks, committed to the project repository as YAML; an execution of it is a run, and an edge is a dependency, so the graph encodes the order of work.

The five actions (Table 1) fall into two groups. The generative actions produce work: answer elicits an open-ended response or file output from an agent or human, compute runs a script or container (sync mode blocks until exit, async detaches so a long job can outlast the session and be reconciled later), and contribute merges several submissions into one shared artifact. The evaluative actions decide: review is a quality gate whose verdict (approve, request_changes, reject) advances, re-opens, or fails the reviewed task, and vote collects choices from several citizens under quorum and threshold rules, cascading losing branches to skipped. That the evaluative actions are ordinary nodes, scheduled and audited like any other task rather than out-of-band approvals, is what lets human judgement enter the graph as a recorded decision. A task can also be routed: assigned to specific citizens or restricted to a role, so a gate reaches the right reviewer and a vote reaches its electorate.

onnaemelifdecytycpleed,outputs,
(thasao(ABCisc(ABBBBCfc(f(Drdfsarreaaugccnbteuoc123oodoeeoaupeev)smerite)mmn)r_ll)r_)vnilbpjequekanpiosmseeeieeedmrceryinnttnaut:aumoactaarewiovtstdy lrencsncvtet_aeipisydehhicttrtet teshemc trdbanioaphaystinntaild fikrgnaeeceskegtaldfa rypenuc(ftaflsnle aboeedslwofb aeuaa etnckcoxs)me omcuiutttors

Figure 2: The task model. (a) A task’s action selects its executor (a human, an LLM agent, or a script); all are the same kind of node, advanced through one lifecycle and one audit trail. (b) Edges carry typed data: a task’s named outputs are consumed downstream by field, and file artifacts flow the same way through declared reads and writes. (c) Fan-out at task scope: a for_each task expands into independent parallel iterations (B1–B3), which a downstream collects task (C) waits on. (d) Fan-out at run scope: an entire run executes once per item, each copy on its own branch. (e) Revision: a review verdict approves a task to done, returns it with request_changes, or rejects it to failed; no work is discarded across the loop, since each attempt is a separate commit, and a per-run cycle budget bounds it.

Every task moves through a lifecycle held as explicit coordinator state: pending until its dependencies clear, ready to be claimed, claimed by a citizen, running, and on submission either routed to review or marked done. Off-ramps cover what real work produces: a task can be failed, skipped (for instance the losing branch of a vote), retried, or parked. These transitions, drawn in Figure 1, are the state machine the coordinator advances; a review verdict can also send a task backward, the revision loop we return to below.

Edges carry typed data, not merely ordering (Figure 2b). A task declares its outputs as named, typed fields, so a downstream task consumes a particular field rather than scraping free text; file results flow the same way, through declared reads and writes that the client materialises from the exact commit upstream produced. A dependency is induced wherever one task consumes another’s field or file, so the data a task needs and the order it runs in are a single declaration.

The graph is not fixed at submission; it accretes structure as the run proceeds. Fan-out with for_each expands one declaration into many independent iterations, each with its own state, branch, and claim, while a downstream collects task waits for them all (Figure 2c); the same applies to an entire run, executed once per item (Figure 2d), and when the list comes from an upstream task’s output the iterations are materialised only once that upstream is accepted. Revision closes the loop: a request_changes verdict returns a task to ready with feedback attached, optionally spawning a remediation task, so work cycles between author and reviewer until accepted (Figure 2e). No work is discarded across the loop: the prior submission and the feedback are carried forward, and because every attempt is a separate git commit, the full revision history stays in the branch. A run-wide cycle budget — a single counter of spawned tasks per run, default 200 and adjustable mid-run — bounds how often the loop may repeat, so a graph that writes its own future tasks cannot run away.

Each of these mechanisms addresses the unreliability of an individual participant: decomposition into small, well-scoped tasks bounds what any one node must get right; typed output contracts constrain what an agent may return; the evaluative actions put judgement on the critical path, as gates each result must clear; and no-discard revision turns a wrong first attempt into the next one’s input.

2.2 The output-neutral coordinator

The coordinator holds and advances the task model just described: it is the authoritative store for workflow state, a stateless HTTP server backed by two independent SQLite databases. Every citizen, human or agent, is a registered identity holding a bearer token that authenticates its requests to the coordinator. The state database holds all mutable state: citizen records, projects, runs, the task DAG, dependency edges, claims, submissions, review verdicts, vote choices, and project membership. The events database holds an append-only audit log, one record per state change. The two are kept separate by design: the state database answers “what does the workflow look like now?” while the events database answers “what happened, in what order?”. A slow or corrupted events database does not stall or corrupt state operations, and vice versa.

Keeping the two separate, rather than folding state out of the event stream as event sourcing does, is deliberate. The event log is an audit record, not the source of truth, and the state database is queried directly. We forgo the replay and time-travel that Enju does not lean on, and in return avoid event sourcing’s standing costs: schema evolution, payload bloat, and a perpetually maintained projection.

Every mutation flows through a single codepath, ApplyPlan. The coordinator first computes the full set of changes implied by a request (for example: record a claim, advance a task to ACCEPTED, cascade newly unblocked dependents to READY), then applies that entire plan to the state database in one transaction, returns to the caller, and finally appends the corresponding records to the events database. Either the state transition commits in full or nothing changes; a partially applied plan cannot leave the DAG inconsistent. Because a claim is just such a transition, two citizens cannot both win the same task: concurrent claims are serialised through this path and resolve to a single winner. Routing all writes through one path also means audit emission is a structural property of the write path rather than something each call site must remember to do.

The coordinator is output-neutral: it stores task state, prompts, citizen records, review decisions, vote choices, and event metadata, but never the outputs a task produces (files, results, agent text), which live in git. Even an upstream result that a downstream prompt refers to is resolved into that prompt by the fat client at claim time, so produced output never reaches the coordinator. The consequence is that Enju inherits git’s durability, history, and diffing without duplicating any of it in a database.

2.3 Fat client

The fat client runs on each citizen’s machine and is the only place where coordinator calls and git operations meet. It can be driven three ways. Over MCP (enju mcp), an LLM client speaks the Model Context Protocol to create projects, start runs, claim tasks, and submit results in natural language. From the CLI (enju go, enju inbox, enju review), the same operations run directly in a terminal with no LLM involved. Through the web UI (enju ui), a citizen watches runs and the event stream.

Beneath all three surfaces sits a single service layer, so every entry point issues the same coordinator HTTP calls, runs the same git operations, and drives the same task state machine: an agent claiming over MCP and a human claiming from the CLI follow an identical path. Other consumers, such as the web UI, reach the system only through this service surface, never the git layer beneath it. Because git and coordinator logic meet in the client rather than in the coordinator, every result is committed locally and never traverses the coordinator, which is what keeps it output-neutral.

2.4 Agent daemons

Agent daemons bring autonomous AI participation into a workflow. A workflow may declare several, each given a role through its system prompt and the kind of tasks it handles: a developer agent that drafts an implementation, a reviewer agent that gates it, a synthesizer that consolidates results. Because review and revision are ordinary tasks (Table 1), work passes between these roles and back again, mirroring the divide-and-iterate pattern of a human team rather than a single model answering in one shot.

An agent daemon is a long-running subprocess that polls for tasks it is eligible to claim and executes them without further prompting. Daemons are not restricted to LLMs. The claude and gemini handlers spawn an LLM CLI subprocess; the compute handler runs a shell script, Python, or any executable; and compute with a container: field runs that script inside Docker or Apptainer. The handler is just the binary name resolved via $PATH, with model-specific flags supplied in the workflow YAML rather than compiled in, so Enju carries no LLM-specific code.

Daemons are forked and supervised by the fat client; the coordinator is not part of their lifecycle and never launches a process. Each daemon uses the same service layer as the MCP and CLI entry points, so an agent claims and submits through exactly the path a human would. Agents are started with enju_agent_start over MCP or with –auto-agents on enju go, and their logs are written locally.

2.5 Git as the output and attribution layer

Because the coordinator is output-neutral, the content a run produces lives in git rather than in an Enju-specific store. Git is a natural substrate for this. Git is itself organized as a directed acyclic graph of commits, navigated through branches and merges, so a run maps onto it directly: each run takes its own branch, each submission is a commit, and reconciliation is a merge. Git is also distributed, which suits a model in which several citizens, working on their own machines, contribute to one shared graph. Using it for produced content lets Enju draw on facilities git already provides rather than reimplementing them.

Storage is the first of these. Produced files and results are versioned as ordinary commits, with git’s history and diffing available unchanged. Attribution follows from the same commits: the fat client records the submitting human as the git author and, when an LLM produced the work, adds a Co-Authored-By trailer naming the model, so each contribution carries both an accountable human and a credited model in infrastructure every repository already has.

Because git is distributed, it also serves as the cross-machine transport. In a team deployment the coordinator carries state while each member’s results move by ordinary git push and fetch from their own machine, so a standard git remote is the synchronization layer and there is no central content store to provision. A remote is needed only for cross-machine work: a single-machine project runs entirely against its local clone, with no origin at all. Access is gated in two layers: the bearer tokens the coordinator already checks, and, because produced content lives in a git remote, the host’s repository permissions that govern who may push it, the latter inherited rather than built.

Git also gives each run reproducible isolation. A run forks its branch from a base commit and is pinned to it, so later edits to the live workflow do not affect an in-flight run; and because every submission, including each revision attempt, is a separate commit, the record of how a result was reached remains in the branch. On completion the branch is reconciled to the base under a configurable policy (none, merge, or push), and a failed run is left untouched for inspection.

For audit, the git history is in itself a durable record of what was produced and by whom, and can be protected against rewriting by ordinary means such as protected branches or signed commits. As a secondary check, Enju also records in its event log the commit SHA observed at each submission; comparing the two reveals any history that was altered after the fact. This pairing is appropriate for a research team running its own coordinator and extends to stronger guarantees without architectural change.

3 Use Cases

We present three use cases that exercise Enju on tasks of different shape: a software build from a specification, a multi-machine scientific data pipeline, and a systematic-review screen. Chosen to span domains, they stress different parts of the design (Section 2): the output-neutral coordinator, git as the attribution and transport layer, multiple citizens working one shared DAG, and human review as an ordinary task action. These are demonstrations, not benchmarks: each is a single run, or a small number of runs, so the figures below characterise what the mechanism did on these inputs, not a distribution over trials. The headline for each case is the behaviour of the coordinator and the review gate, not the quality of any model’s output. All runs used Claude models (Sonnet- and Haiku-class), with the model bound to each citizen stated per case; full run branches, per-task commits, and event logs are available in each case’s public repository (Section 5).

3.1 Use Case 1: Mustache Template Engine, Built From a Specification

Motivation. This case puts Enju to work on a substantial software build and, through it, shows three things the design is meant to support. The first is decomposition: a complex deliverable is broken into well-scoped sub-tasks, each small enough that a model can do it reliably, with the dependencies between them expressed as the DAG. The second is an explicit development-iteration loop: every build step is gated by a review, and where the reviewer finds a deficiency the step is revised on that feedback and re-reviewed before any dependent work proceeds. The third is multi-citizen collaboration: several agents, with distinct roles and access, claim work from one shared graph, and every result is a git commit attributed to a human and the model that produced it. The workload is chosen so that correctness is objectively checkable, which lets us verify not only that the process ran but that it produced a sound artifact.

The workload is a Mustache templating engine in Python, built only from the official specification text and its JSON conformance corpus, under an explicit no-peeking rule forbidding reference to existing implementations. Enju decomposed the build into a 12-task DAG along the engine’s natural module boundaries (context, tokenizer, parser, renderer, and a glue layer): each module is produced by a develop task and gated by a matching review task, and a downstream develop task depends on the upstream review, not on the upstream code, so each module is built against reviewed, approved input rather than whatever the previous step happened to emit. Six distinct Sonnet-class citizens worked the graph: two developers, two reviewers (given read-only tools, so a reviewer can gate a step but never author its fix), a tester, and a narrator (Figure 3).

cotoparegltenarerededeterereAnkrnusrqqvvstvvlltensdetraueueeeeieie cexierertsslolorwwittzeet_t_pp /eeizerrccerer nrrehha a anaaagagrggsnneerenen aggnnatttresestto 1 2e 1 2r C(r(rlaeeuaadd-d-e o o Snnolylyn))net agent daemons; the human operator is the git author on every commit.

Figure 3: The Mustache build DAG, with task nodes coloured by the citizen that ran them (legend below). The build was executed by autonomous Claude Sonnet agent daemons; the human operator is the accountable git author on every commit. The develop tasks are split across two developer agents (blue, violet) and the review gates across two reviewer agents (the two diamond shades); the independent foundation modules context and tokenizer were claimed and worked in parallel, one developer each. Every later module depends on the review of its predecessor, not on its code, so it builds against approved input. Dashed arcs mark the two gates that returned request_changes (renderer, glue), each triggering a revise/re-review cycle before its dependents ran.

The two foundation modules (context, tokenizer) have no dependencies and were claimed and worked in parallel by the two developer citizens; the coordinator serialised the remaining stages along the declared edges. The development-iteration loop then ran as designed. Of the five review gates, three (review_context, review_tokenizer, review_parser) approved on the first pass. Two did not: review_renderer and review_glue, each handled by a different reviewer citizen, returned request_changes with specific feedback on the corresponding develop task. In each case the coordinator rolled that task back to a second attempt and held the tasks that transitively depend on it in PENDING; the developer revised against the feedback, the reviewer re-reviewed and approved the second attempt, and only then were the downstream tasks released. Each revise/re-review/approve cycle completed before any dependent task ran, and every round is visible in git as a distinct attempt on the run branch.

The build completed all twelve tasks, producing a working engine of about 517 lines across the five modules. To confirm the gated process yielded a sound artifact and not merely a process that terminated, we re-tested the engine with an independent harness that loads the official corpus directly and calls the produced render(): it passed 136 of 136 in-scope conformance tests, across the comments, delimiters, interpolation, inverted-section, partials, and section categories. The independent result matched the tester citizen’s own report, corroborating that the report was not confabulated. The decomposition and review loop thus produced a spec-conformant engine.

The engine is spec-core-complete, not production-hardened: the three optional Mustache modules (lambdas, inheritance, dynamic names) were out of scope by design, so lambdas are unsupported; there was no fuzzing, performance, or adversarial testing beyond the official corpus; and the operator deep-read only the renderer and glue, while the larger parser and tokenizer were checked against the corpus but not line-audited. The two review-gate corrections are organic events in a single run, not a measured rate.

3.2 Use Case 2: Nanopore Phage-Genome Assembly Across Two Machines

Motivation. Bioinformatics pipelines, a chain of quality-control, assembly, and annotation tools wrapped in containers, are a staple of computational biology and are usually run with workflow managers such as Snakemake or Nextflow. This case shows such a pipeline expressed and run in Enju, with no language model in the loop, and through it three things: the output-neutral coordinator driving opaque containerised stages it never inspects; per-sample fan-out with a collecting fan-in; and git as both the attribution record and the only cross-machine transport. The point is that Enju expresses a conventional deterministic pipeline as readily as it does AI work, and adds git-native provenance and cross-machine movement on top.

The workload is an Oxford Nanopore phage-genome assembly pipeline backed by a public GitHub repository: 13 containerised compute stages spanning read QC (NanoPlot, Filtlong), assembly and polishing (Flye, Medaka), assembly QC (seqkit, QUAST, CheckV), phage annotation (Pharokka), and reporting (MultiQC); the DAG is shown in Figure 4. The run fans out over samples, one pipeline instance per sample, and the report stages collect across them. On a server, enju go executed the DAG; the workflow declared sync: push, so on completion the coordinator merged the run branch and pushed the default branch to the configured GitHub remote. A second machine then obtained the full result by an ordinary git pull: no Enju-specific transport, no shared filesystem, no content passing through the coordinator. Each stage ran as a container the coordinator never inspected, and every artifact is a git commit authored by the human operator, with the provenance trail on the run branch and the clean deliverable on the default branch.

NFNsuFMseQCPMverefocoreasasanreAa(ilta(mlyeqUhhurprlassnpllnoralnofiQmedkAecarltsio_ledememoor sPwonPltCaakitSkoiQortec Qbbtttal)gl)raTVkCnsatCllatgooykac ayyioetthc QnsroC asasremsp scolanemt:pa eleinvseerirysedstacgomep uupt teo ta Pshkasr rokukna b ryu tnhse o hnucemapner opsaemraptlore; no language model is in the loop.

Figure 4: The Nanopore phage-assembly pipeline DAG, coloured by stage. Every node is a containerised compute task. The run fans out over samples (every stage up to Pharokka runs once per sample) and the report stages collect across samples. The coordinator never inspects the containers; on completion the run branch was merged and pushed, and a second machine reproduced the full result by an ordinary git pull.

The pipeline ran end-to-end across two machines with git as the only transport and no bespoke synchronisation code, and two independent clean projects reproduced the headline result byte-for-byte. As an illustrative pipeline artifact, not a biological finding, sample barcode09 assembled into a single contig of 58,831bp at 49.0% GC, rated high-quality and 96.96% complete by the CheckV quality-assessment stage with no detected contamination.

The biology is incidental: this is one sample of one organism, and the result demonstrates the coordination and git-native transport model, not a biological finding. The two reproductions establish determinism of this pipeline on this input, not a general reproducibility guarantee.

3.3 Use Case 3: PRISMA Systematic-Review Screen

Motivation. This case exercises human review as an ordinary task action in a setting where an AI step can fail silently, and secondarily the output-neutral coordinator (deterministic compute stages it does not inspect) and several agent citizens with a human on one shared DAG. Literature screening stresses the review gate well: a screening model can confabulate, for instance by emitting decisions it did not derive from the inputs, and such a failure is invisible without an explicit check. The question is whether the review action functions as genuine quality control across a multi-stage pipeline, both catching such failures and approving sound work rather than rubber-stamping it.

The workload is a PRISMA 2020 systematic-review screen of randomised controlled trials of faecal microbiota transplantation (FMT) for recurrent Clostridioides difficile infection, over PubMed for 2013–2024. The run was a 12-task DAG that interleaves deterministic compute with agent work: containerised compute stages (PubMed search, deduplication, PRISMA counting, full-text retrieval, and SVG diagram generation, each a Python script in a python:3.12-slim container) feed four Sonnet-class agent citizens (a screener, an uncertainty resolver, a data extractor, and a synthesiser), with two human review gates: one after abstract screening and one after the resolution of uncertain records (Figure 5).

PecPuscfrxoRbcoetestruIMreucoantSeenhlvcsMdntsetA
dar(fur(scAhsebeaundefiydioIueadstvibsll-cavinntammruret tertaeahgrpagachpawraextwl)eauenctctaisimtenstnsetre)(c (voSienonwtanieganettr)e)

Figure 5: The PRISMA review DAG (run 6), coloured by executor: deterministic compute stages (teal), Sonnet agent tasks (blue), and human review gates (orange diamonds) on one graph. The screener’s output passes a human gate before full-text retrieval; an uncertainty resolver’s output passes a second human gate; an extractor and a synthesiser then produce the final data and write-up. This is the case where humans, AI agents, and deterministic compute act as peers on a single shared DAG.

On the full run, the pipeline carried 220 identified records to 219 retrieved, 217 screened after de-duplication, then 201 excluded, 13 included, and 3 marked uncertain at abstract screening; after full-text resolution of the uncertain records, a final set of 14 included RCTs; the synthesiser produced a publication-style synthesis.md with a 14-row trial table whose reported PMIDs and outcome figures were spot-checked against the published trials. Both human gates were exercised.

The review gate appeared in three distinct roles across runs. In a catch instance, an earlier screening citizen backed by a smaller (Haiku-class) model emitted 170 decisions for 99 input records, echoing prior numbers rather than screening the inputs; the human gate caught the decision-count mismatch, returned request_changes, and the citizen re-ran to produce a genuine 99-record screen. In a validate instance, a Sonnet screener on a contained slice produced 120 decisions for 120 inputs with no anomaly, and the gate reviewed and approved it. In the full-pipeline instance above, the screening passed review and the run proceeded through resolution, extraction, and synthesis. The review gate acted as quality control rather than a formality: it caught a confabulating screen and forced a corrected pass in one case, and reviewed and approved sound screens in others.

This is a systems demonstration, not a citable systematic review. Each mode is a single run on one topic, with one screener, one resolver, and one extractor, so there is no inter-rater agreement and no measured error rate; only PubMed was searched; and the synthesis is a working draft, internally consistent with the extracted data but not editorially reviewed for publication. The contribution evidenced is the review action operating as a quality gate over AI work, not the screening accuracy or the synthesis as a finished scientific product.

4 Discussion

Enju is a workflow system in the established DAG-of-tasks tradition, with additions aimed at multi-actor work on a shared graph. A coordinator lets humans, autonomous AI agents, and deterministic compute tasks claim and work the same DAG as peers, with agents part of the system rather than external callers, and with human review and voting expressed as ordinary task actions rather than out-of-band approvals. The coordinator carries task state while git holds everything produced, so one graph can mix all three kinds of work without a content store, an identity system, or a separate cross-machine transport, and provenance follows from the ordinary commit path. In this it treats the problem as one of coordination rather than execution [14], with human review and provenance falling out of ordinary use rather than added as separate machinery.

The three workflows in Section 3 demonstrate these capabilities rather than benchmark them, each exercising a different part of the design. The phage-assembly pipeline ran as opaque containerised stages the coordinator never inspected, its results crossing to a second machine through a git remote alone. The software build and the systematic-review screen both turned on review as a task action: agent review gates returned request_changes and held dependents until a revision was approved, and a human gate caught a model that had confabulated a decision count and forced a corrected pass. Across all three, the coordination structure stayed correct and useful even where a model did not, the cross-cutting theme of the paper. These are demonstrations, not production-grade benchmarks: each is a single run, without repetition or controlled comparison.

Access control is coarse-grained: coordinator tokens gate who may act and the git host’s permissions gate the produced content, but a git repository grants read access as a whole, not per file or per citizen. Confidential inputs can stay out of git — declared untracked (track: false) and referenced in place, or run on a single machine with no remote. Per-citizen access within one repository means partitioning by repository today, and is future work.

4.1 Future work

Several directions would extend both the evidence and the system. On the evidence, the most important is multi-citizen work at scale: workflows with many independent humans, each running their own model, contending for tasks on one shared graph, to test whether the coordination properties hold beyond the small parallel runs shown here. Additional workflows in computational biology and adjacent domains would exercise the system against more demanding pipelines and show where its guarantees hold or need extension. Scaling in this direction would also put the coordinator’s single write path under sustained concurrent load; that path is covered today by the suite’s concurrency tests and by the parallel runs reported here, and measuring it at scale would indicate when sharding or replicating the coordinator becomes worthwhile.

On the system itself, the coordinator can be made leaner. It currently holds task state together with prompts and task definitions; moving those into the git-resident workflow snapshot would leave the coordinator to track only the DAG and its decisions, narrowing it toward a pure state machine and reducing the user-supplied data it holds. Compute tasks, today run locally or in Docker and Apptainer containers, could also be dispatched to HPC batch schedulers such as SLURM, so that long or large stages run on shared cluster infrastructure while keeping the same task contract and git-native result path.

5 Availability

Enju is open-source software, released under the MIT license; the source code and documentation are available in the project repository. Each use case in Section 3 has its own public repository containing the workflow definition and the git history of the run, which records every produced artifact and review decision, so the reported runs are reproducible from the version of Enju in the project repository. This paper was drafted with AI assistance; all technical claims and the system itself were authored and verified by the human author.

References

[1]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. “Lost in the Middle: How Language Models Use Long Contexts”. In: Transactions of the Association for Computational Linguistics 12 (2024), pp. 157–173. doi: 10.1162/tacl_a_00638.

[2]

J. Köster and S. Rahmann. “Snakemake—a scalable bioinformatics workflow engine”. In: Bioinformatics 28.19 (2012), pp. 2520–2522. doi: 10.1093/bioinformatics/bts480.

[3]

P. Di Tommaso, M. Chatzou, E. W. Floden, P. Prieto Barja, E. Palumbo, and C. Notredame. “Nextflow enables reproducible computational workflows”. In: Nature Biotechnology 35.4 (2017), pp. 316–319. doi: 10.1038/nbt.3820.

[4]

M. R. Crusoe, S. Abeln, A. Iosup, P. Amstutz, J. Chilton, N. Tijanić, H. Ménager, S. Soiland-Reyes, B. Gavrilović, and C. Goble. “Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language”. In: Communications of the ACM 65.6 (2022), pp. 54–63. doi: 10.1145/3486897.

[5]

The Galaxy Community. “The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update”. In: Nucleic Acids Research 50.W1 (2022), W345–W351. doi: 10.1093/nar/gkac247.

[6]

E. Deelman et al. “Pegasus, a workflow management system for science automation”. In: Future Generation Computer Systems 46 (2015), pp. 17–35. doi: 10.1016/j.future.2014.10.008.

[7]

Apache Software Foundation. Apache Airflow. https://airflow.apache.org. 2026.

[8]

Dagster Labs. Dagster. https://dagster.io. 2026.

[9]

Prefect Technologies, Inc. Prefect. https://www.prefect.io. 2026.

[10]

Temporal Technologies, Inc. Temporal. https://temporal.io. 2026.

[11]

LangChain, Inc. LangGraph. https://www.langchain.com/langgraph. 2026.

[12]

CrewAI, Inc. CrewAI. https://www.crewai.com. 2026.

[13]

Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. 2023. doi: 10.48550/arXiv.2308.08155.

[14]

T. W. Malone and K. Crowston. “The interdisciplinary study of coordination”. In: ACM Computing Surveys 26.1 (1994), pp. 87–119. doi: 10.1145/174666.174668.