Technical · Dispatch № 02

Shipping Code AFK: A Docker Sandbox and Ralph Loop for Laravel and Claude Code

A step-by-step guide to running Claude Code unattended on a Laravel project — building the sandbox image, wiring up a self-driving loop, and learning what breaks the hard way.

If you have a backlog you'll never finish, stop working tickets by hand. Most of us use coding agents the wrong way. We sit there. We approve tool calls. We babysit a 20-minute task and call that leverage. It isn't. It's a slightly faster version of doing the work yourself, with the agent as a very expensive pair of hands that can't be trusted alone in the room.

The shift I'm after is different. Claude picks an issue off the backlog while I'm doing something else. Builds it. Runs the tests. Commits. Closes the issue. Moves to the next one. I come back to a queue that's shorter than I left it.

That shift has a name now. "Ralph loop" was coined by Geoffrey Huntley in Ralph Wiggum as a 'software engineer' (July 2025). In its purest form it's literally while :; do cat PROMPT.md | claude-code ; done — a dumb, persistent while true that pipes the same prompt into a coding agent until the work is done. The joke is that Ralph Wiggum is the agent: lovable, not very bright, relentlessly persistent, and every time he falls off the slide you add another sign to the prompt ("SLIDE DOWN, DON'T JUMP"). It's funny. It's also the cheapest known path from "AI as autocomplete" to "AI as a coworker with a ticket queue."

Two pieces make it safe. A Docker sandbox, so Claude can't touch your host, your database, or anything on your LAN. And a Ralph loop, so it keeps going until the queue is empty. Miss either piece and you get the thing everyone's afraid of, an agent with rm -rf energy and no walls.

This post walks through the whole setup on a Laravel project. Herd on the host, Postgres inside the sandbox, Pest 4 with ParaTest, Playwright for browser tests. Most of the shape of the workflow comes from Matt Pocock's Claude Code for Real Engineers cohort — the Ralph loop pattern, sandboxed autonomous agents, and the feedback-loop mindset are all from there. Huge thanks to Matt. What this post adds is the Docker + Laravel + Herd + Postgres + Pest 4 substrate, and the specific things that broke along the way.

Why not use Sandcastle?

A fair question. Matt also maintains Sandcastle, a TypeScript library that productizes exactly this pattern — sandbox provider, iteration loop, completion signal, branch strategy, the lot. It's a great piece of software and if you want the Ralph loop as an API call, it's the obvious choice.

I didn't use it for two reasons:

  1. I'm cautious about third-party tooling on top of my Claude subscription. Sandcastle invokes the official claude CLI inside a container using my own auth, so there's no obvious ToS issue with the library itself. But unattended long-running loops are already an unusual usage pattern, and I'd rather not add an extra layer that could look, from Anthropic's side, like something other than me driving the CLI. The risk is probably small. It's not zero, and I'd rather not find out.
  2. I want full control. Every piece of this setup — the Dockerfile, setup.sh, the jq filter in afk.sh, the prompt — changed multiple times as I hit real problems. Owning those files outright meant the fix was always "edit this script"; with an orchestration layer in the middle, some fixes would've meant "patch the library or route around it." For a personal workflow I run every day, the DIY version is genuinely less friction.

Neither reason is an argument against Sandcastle. Just an explanation of why this post describes a handful of shell scripts instead of an import.

The shape of unattended coding

Before any Dockerfile, there's a mental model worth getting straight. Skip it and the tactical steps read like a cargo cult.

Three ideas do all the work.

The sandbox is what unlocks the agent. The reason coding agents feel like a ceremony is --permission-mode bypassPermissions. Nobody wants to grant it on their host, and they're right not to. But put the agent inside a container that can't reach your host, your LAN, or anything past an HTTP proxy, and bypassing permissions stops being reckless. It becomes the only sane default. The walls are what let the agent move fast.

That's the missing piece. Not the loop. Not the prompt. The boundary.

The backlog is the interface. A Ralph loop needs somewhere to pull tasks from. You could use a TODO file. Don't. Use your repo's GitHub issues, tag them afk or hitl, and let the agent pull from the same queue you do. Completion is closing the issue. Handoff is a comment. Humans and agents share the board, no reconciliation, no special state.

Feedback compounds. Every iteration, the agent writes to a feedback.md file with what was slow, what was confusing, what broke. You read it the next morning and patch the Dockerfile, the prompt, or the setup script. The loop gets better week over week not because you sit down to redesign it, but because the agent keeps telling you where it hurts.

Hold those three ideas in your head. Everything below is plumbing in service of them.

The workflow at a glance

  1. I triage backlog into GitHub issues and tag them either AFK (safe for the loop to pick up) or HITL (human in the loop — design calls, risky refactors, anything I want to drive myself).
  2. I open a sandbox with sbx run shell and run ./ralph/afk.sh N inside it.
  3. Each iteration, afk.sh feeds Claude the last 5 commits + open GitHub issues + ralph/prompt.md and lets it pick one AFK issue, build it end-to-end, run the scoped feedback loops, commit, close the issue, and log friction to ralph/feedback.md.
  4. When there are no AFK issues left, Claude emits <promise>NO MORE TASKS</promise> and the loop exits.

Claude runs with --permission-mode bypassPermissions inside the sandbox. No tool-call prompts. No babysitting. The network proxy is the only thing standing between the agent and the outside world — and that's enough.

The ralph/ file layout

FilePurpose
DockerfileBuilds the sandbox template (PHP, Postgres, pnpm, Playwright).
setup.shRuns once per session at the top of afk.sh — boots Postgres, fixes perms, ensures UTF8 template1, skips pnpm install when node_modules is valid.
afk.shThe N-iteration loop itself (stream-json + jq filter + stop-token watch).
prompt.mdThe playbook Claude reads every iteration.
feedback.mdAppend-only log of friction the agent hit; not committed.

Step 1: Pick the right sandbox runner

There are two options, and the choice matters.

I started with docker sandbox — the subcommand shipped with Docker Desktop. It works. It's also capped at 4 GB of memory, and that's not enough.

Claude, Postgres, ParaTest with 10 workers, Vite, Playwright. Something gets killed mid-iteration. Usually Postgres. The Linux out-of-memory killer shoots whichever process is the biggest offender when a container hits its ceiling, and you only find out when the tests fail for reasons that make no sense.

Docker ships a second CLI called sbx that creates sandboxes without the Desktop memory limit. I switched. The out-of-memory kills went away. Two gotchas come with that switch:

  1. sbx only pulls images from a registry — it doesn't use local Docker images. Building the template locally (docker build -t claude-php-8.4 ralph/) is not enough. You have to push it to Docker Hub (or another registry that sbx can pull from) and reference it by its registry name.
  2. sbx run doesn't print the initial prompt back to stdout the way docker sandbox run claude . -- "..." did. Passing a long prompt as an argument works, but you don't see what Claude received, which made debugging the loop painful. The workaround (run the loop from inside an interactive shell) is in Step 5.

Prereqs from here on out: the sbx CLI and a Docker Hub account to host the template image.

Step 2: Build the custom sandbox image

The base docker/sandbox-templates:claude-code image includes Node.js, Git, GitHub CLI, ripgrep, jq, Go, and Python 3 — but no PHP or PostgreSQL. I extend it with PHP 8.4 (all Laravel Cloud-supported extensions), PostgreSQL 17, Composer, pnpm, and Playwright's browser dependencies.

The Dockerfile lives at ralph/Dockerfile:

FROM docker/sandbox-templates:claude-code
 
USER root
 
RUN apt-get update && apt-get install -y \
    php8.4 \
    php8.4-cli \
    php8.4-common \
    php8.4-apcu \
    php8.4-bcmath \
    php8.4-bz2 \
    php8.4-curl \
    php8.4-dba \
    php8.4-enchant \
    php8.4-excimer \
    php8.4-gd \
    php8.4-gmp \
    php8.4-igbinary \
    php8.4-imagick \
    php8.4-imap \
    php8.4-intl \
    php8.4-ldap \
    php8.4-mbstring \
    php8.4-mongodb \
    php8.4-msgpack \
    php8.4-mysql \
    php8.4-odbc \
    php8.4-opcache \
    php8.4-pcov \
    php8.4-pgsql \
    php8.4-pspell \
    php8.4-readline \
    php8.4-redis \
    php8.4-snmp \
    php8.4-soap \
    php8.4-sqlite3 \
    php8.4-tidy \
    php8.4-xml \
    php8.4-zip \
    postgresql \
    postgresql-client \
    unzip \
    libnss3 \
    libnspr4 \
    libatk1.0-0t64 \
    libatk-bridge2.0-0t64 \
    libatspi2.0-0t64 \
    libcups2t64 \
    libxcomposite1 \
    libxdamage1 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libpango-1.0-0 \
    libcairo2 \
    libasound2t64 \
    libxshmfence1 \
    libx11-xcb1 \
    libdrm2 \
    libxkbcommon0 \
    fonts-liberation \
    && rm -rf /var/lib/apt/lists/*
 
COPY --from=composer:2.9.5 /usr/bin/composer /usr/bin/composer
 
RUN mkdir -p /home/agent/pgdata /run/postgresql \
    && chown -R agent:agent /home/agent/pgdata /run/postgresql
 
USER agent
 
RUN npm install -g pnpm
 
RUN /usr/lib/postgresql/17/bin/initdb -D /home/agent/pgdata --encoding=UTF8 --locale=C.UTF-8
 
RUN npm install -g playwright@latest
 
RUN npx playwright install

A few things worth calling out:

  • php8.4-pcov so Pest can produce coverage reports inside the sandbox.
  • A long list of Playwright/Chromium shared libraries (libnss3, libatk1.0-0t64, libcups2t64, libgbm1, etc.) required for Pest 4 browser tests.
  • pnpm globally via npm install -g pnpm.
  • playwright globally and its browsers via npx playwright install.
  • A pre-initialized PostgreSQL data directory owned by agent so Postgres can start without root at runtime. I initialize it with --encoding=UTF8 --locale=C.UTF-8 because ParaTest's per-worker databases inherit template1's encoding, and a mismatched encoding blocks parallel test runs (more on this in Step 4).

Build it and push it:

docker build -t guetteman/claude-php-8.4 ralph/
docker push guetteman/claude-php-8.4

The push is what makes the template visible to sbx. Skip it and sbx run --template guetteman/claude-php-8.4 ... will fail to pull. Now the image exists, but a running container is not a running environment. Postgres still needs to boot, permissions still need fixing, and a stray ANTHROPIC_API_KEY still needs to get unset before it costs you real money.

Step 3: Decide where Postgres runs (spoiler: inside the sandbox)

Your first instinct will be to point the sandbox at Herd's Postgres on the host. It won't work.

Docker sandboxes, both docker sandbox and sbx, only allow HTTP/HTTPS outbound traffic. Raw TCP is blocked at the network layer. That means Postgres (5432), MySQL (3306), Redis, anything on the LAN is unreachable. A regular docker run can punch through with --network host or --add-host. Sandboxes strip those flags out on purpose. The walls are the feature.

So Postgres moves inside. Your .env gets these credentials.

DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=resumeskit
DB_USERNAME=root
DB_PASSWORD=

No password, since PostgreSQL uses trust authentication for local connections. Which means every new iteration needs to boot Postgres from scratch, fix its ownership, and make sure template1 is UTF8 before ParaTest clones from it. That's what setup.sh is for.

Step 4: Write setup.sh — what the sandbox needs before Claude starts

ralph/setup.sh runs once at the top of afk.sh, before the iteration loop begins. It handles everything that can't live in the image because it depends on mounted project state, on the sandbox's runtime user IDs, or on tmpfs state that gets wiped on every container boot:

#!/bin/bash
set -eo pipefail
 
if ! pnpm exec vite --version >/dev/null 2>&1; then
  make install-pnpm
fi
 
# /var/run is tmpfs; the dir the Dockerfile created is gone on every boot.
sudo mkdir -p /var/run/postgresql
sudo chown agent:agent /var/run/postgresql
 
PG=/usr/lib/postgresql/17/bin
 
if ! "$PG/pg_isready" -h 127.0.0.1 -q; then
  if [ -f /home/agent/pgdata/postmaster.pid ] && ! pgrep -F /home/agent/pgdata/postmaster.pid >/dev/null 2>&1; then
    rm -f /home/agent/pgdata/postmaster.pid
  fi
  : > /home/agent/pgdata/logfile
  "$PG/pg_ctl" -D /home/agent/pgdata -l /home/agent/pgdata/logfile -w start || {
    echo "postgres failed to start; logfile:"
    cat /home/agent/pgdata/logfile
    exit 1
  }
fi
 
"$PG/psql" -h 127.0.0.1 -d postgres -c 'CREATE ROLE root WITH LOGIN SUPERUSER;' 2>/dev/null || true
 
if ! "$PG/psql" -h 127.0.0.1 -d postgres -tAc \
  "SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname='template1'" \
  | grep -q '^UTF8$'; then
  "$PG/psql" -h 127.0.0.1 -d postgres <<'SQL'
UPDATE pg_database SET datistemplate=FALSE WHERE datname='template1';
DROP DATABASE template1;
CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'UTF8';
UPDATE pg_database SET datistemplate=TRUE WHERE datname='template1';
SQL
fi
 
"$PG/createdb" -h 127.0.0.1 resumeskit 2>/dev/null || true
"$PG/createdb" -h 127.0.0.1 test 2>/dev/null || true

All steps are idempotent — re-running setup.sh on an already-booted sandbox is a no-op. The early draft of this script was shorter and more defensive (lots of 2>/dev/null || true), and that turned out to be a mistake. The current shape — explicit state checks, loud failures — only exists because I hoisted setup out of the loop and watched five latent bugs fall out in sequence. That's Step 7.

ParaTest and template1 encoding

When I enabled parallel tests I hit this from 9 of 10 ParaTest workers:

new encoding (UTF8) is incompatible with the encoding of the
template database (SQL_ASCII).

ParaTest creates per-worker databases by cloning template1. If template1 isn't UTF8, every CREATE DATABASE ... ENCODING 'UTF8' call fails. I fixed this in two places:

  1. Dockerfile initdb --encoding=UTF8 --locale=C.UTF-8 so new images ship UTF8 from the start.
  2. setup.sh re-encodes template1 idempotently for sandboxes built against an older image. The script drops and recreates template1 from template0 with ENCODING = 'UTF8' only if it isn't already UTF8.

A related fix at the Laravel layer: phpunit.xml used to override .env.testing with SQLite :memory:, which silently defeated the Postgres setup. I removed the override so .env.testing's PostgreSQL credentials actually apply during tests.

Unset ANTHROPIC_API_KEY in afk.sh, not setup.sh

afk.sh clears the API key once, before the loop:

[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEY

The guard keeps set -u / set -o pipefail happy when the variable is already absent. Without this, if the host's ANTHROPIC_API_KEY leaks through into the sandbox environment, the CLI silently picks it up and bills API credits instead of using your Claude subscription.

The unset has to live in afk.sh, not setup.sh. setup.sh is invoked as a child process (./ralph/setup.sh, not source), so unset inside it only mutates the subshell's environment — the parent afk.sh still has the key, and every claude invocation it spawns inherits it. Put the line in the file that invokes claude. More on this in Step 7.

Step 5: Run the loop from inside an interactive shell

My automation loop is ralph/afk.sh, which shells claude --verbose --print --output-format stream-json ... N times, formats the stream, and stops when Claude emits <promise>NO MORE TASKS</promise>.

Originally afk.sh wrapped the claude call with docker sandbox run claude . -- .... That broke after I moved to sbx: the CLI swallows the prompt text in its output, so the loop ran blind — I couldn't tell what Claude saw.

The fix is to not pass the workload as an argument to sbx at all. Open an interactive shell in the sandbox and run afk.sh from inside it:

# First time only — creates the sandbox from the template:
sbx run shell --template guetteman/claude-php-8.4
 
# Every subsequent session — reuses the existing sandbox:
sbx run shell
 
# Then, inside the sandbox:
./ralph/afk.sh 20

First-time bootstrap inside the shell

The template ships Claude Code and gh, but neither is authenticated. On first entry:

  1. Install/update the Claude CLI if it's missing, then run claude once and complete the interactive login.
  2. Run gh auth login and complete the GitHub device flow — the loop shells out to gh issue list every iteration, so this has to work non-interactively afterwards.
  3. Confirm ANTHROPIC_API_KEY is unset (setup.sh enforces this every iteration, but check once by hand).

Two consequences of running the loop from inside the shell:

  • The script calls claude directly (no docker sandbox run claude . -- prefix). It sets --permission-mode bypassPermissions and --include-partial-messages so I can stream tool-call summaries as they happen instead of waiting for full messages.
  • afk.sh invokes ralph/setup.sh once, before the for loop. The prompt passed to claude no longer mentions setup — pushing that bootstrap into Claude's first tool call on every iteration was burning a full round-trip (parse → schedule → stream → complete) per loop for a script that's idempotent anyway.

The stream is filtered by a jq program that turns each content_block_start / content_block_delta / content_block_stop event into a readable line — so instead of raw JSON you see [Bash] composer install, [Read] app/Models/User.php, etc. That filter lives inline in afk.sh.

Here's the full script:

#!/bin/bash
set -eo pipefail
 
if [ -z "$1" ]; then
  echo "Usage: $0 <iterations>"
  exit 1
fi
 
# jq filter to extract final result
final_result='select(.type == "result").result // empty'
 
# jq filter: stream text and format tool calls with readable summaries
stream_text='foreach inputs as $obj (
  {tool: "", input: "", emit: null};
 
  if $obj.type == "stream_event" then
    if $obj.event.type == "content_block_delta" then
      if $obj.event.delta.type == "text_delta" then
        .emit = $obj.event.delta.text
      elif $obj.event.delta.type == "input_json_delta" then
        .input += ($obj.event.delta.partial_json // "")
        | .emit = null
      else .emit = null end
 
    elif $obj.event.type == "content_block_start"
      and $obj.event.content_block.type == "tool_use" then
        .tool = $obj.event.content_block.name
        | .input = ""
        | .emit = null
 
    elif $obj.event.type == "content_block_stop" then
      if .tool != "" then
        (.input | try fromjson catch {}) as $inp |
        .emit = "\n[\(.tool)] \(
          if   .tool == "Bash"  then ($inp.command // "")
          elif .tool == "Read"  then ($inp.file_path // "")
          elif .tool == "Edit"  then ($inp.file_path // "")
          elif .tool == "Write" then ($inp.file_path // "")
          elif .tool == "Grep"  then ([$inp.pattern, $inp.glob, $inp.path] | map(select(. != null)) | join(" "))
          elif .tool == "Glob"  then ($inp.pattern // "")
          elif .tool == "Agent" then ($inp.description // "")
          else ($inp | to_entries | map(.key + "=" + (.value | tostring)) | join(" "))
          end
        )\n"
      else .emit = null end
      | .tool = ""
      | .input = ""
 
    elif $obj.event.type == "message_stop" then
      .emit = "\n\n"
    else .emit = null end
  else .emit = null end;
 
  .emit // empty
)'
 
[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEY
 
ralph/setup.sh
 
for ((i=1; i<=$1; i++)); do
  tmpfile=$(mktemp)
  trap "rm -f $tmpfile" EXIT
 
  commits=$(git log -n 5 --format="%H%n%ad%n%B---" --date=short 2>/dev/null || echo "No commits found")
  issues=$(gh issue list --state open --json number,title,body,comments)
  prompt=$(cat ralph/prompt.md)
 
  echo ""
  echo "========================================="
  echo " Loop $i / $1"
  echo "========================================="
  echo ""
 
  claude --verbose --print --permission-mode bypassPermissions --output-format stream-json --include-partial-messages \
    "Previous commits: $commits $issues $prompt" \
  | grep --line-buffered '^{' \
  | tee "$tmpfile" \
  | jq -n --unbuffered -rj "$stream_text"
 
  result=$(jq -r "$final_result" "$tmpfile")
 
  if [[ "$result" == *"<promise>NO MORE TASKS</promise>"* ]]; then
    echo "Ralph complete after $i iterations."
    exit 0
  fi
done

A few details worth pointing out:

  • setup.sh runs once, before the loop. Earlier versions injected "Before starting, run ralph/setup.sh to set up the environment." into every iteration's prompt, which meant paying for a full Claude tool-call round-trip on every loop just to re-run an idempotent script. One invocation up front does the same job.
  • tee writes the raw stream to a tmpfile while jq formats it to the terminal. After the run finishes, a second jq pulls out the final result text and greps it for the stop token.
  • The stop token is <promise>NO MORE TASKS</promise>. The <promise> tag is just a convention I picked — what matters is that it's a string Claude won't emit by accident, so the exit condition is unambiguous.
  • --permission-mode bypassPermissions is what makes this unattended at all — without it every tool call would block waiting for approval. It's only safe because the sandbox is the outer boundary.

Step 6: Write the prompt

The prompt is what gives every iteration its marching orders. afk.sh concatenates three things before firing Claude:

  1. The last 5 git commits (so the agent has recent context across iterations).
  2. The open GitHub issues (gh issue list --json number,title,body,comments).
  3. The contents of ralph/prompt.md.

prompt.md itself is structured as a short playbook:

  • ISSUES / TASK SELECTION — parse the issue list, work only on issues tagged AFK (not HITL), prioritize in a fixed order (critical bugfixes → dev infra → tracer bullets → polish → refactors), and emit <promise>NO MORE TASKS</promise> when the queue is empty. afk.sh watches for that token and stops the loop.
  • EXPLORATION / IMPLEMENTATION — explore the repo, then build the task. One task per iteration.
  • FEEDBACK LOOPS — which make checks to run before committing.
  • COMMIT / THE ISSUE — commit with decisions + files + blockers; close the issue when done, otherwise leave a progress comment.
  • FEEDBACK — append session notes to ralph/feedback.md under a dated heading.

Here's the full file:

# ISSUES
 
GitHub issues are provided at start of context. Parse it to get open issues with their bodies and comments.
 
You will work on the AFK issues only, not the HITL ones.
 
You've also been passed a file containing the last few commits. Review these to understand what work has been done.
 
If all AFK tasks are complete, output <promise>NO MORE TASKS</promise>.
 
# TASK SELECTION
 
Pick the next task. Prioritize tasks in this order:
 
1. Critical bugfixes
2. Development infrastructure
 
Getting development infrastructure like tests and types and dev scripts ready is an important precursor to building features.
 
3. Tracer bullets for new features
 
Tracer bullets are small slices of functionality that go through all layers of the system, allowing you to test and validate your approach early. This helps in identifying potential issues and ensures that the overall architecture is sound before investing significant time in development.
 
TL;DR - build a tiny, end-to-end slice of the feature first, then expand it out.
 
4. Polish and quick wins
5. Refactors
 
# EXPLORATION
 
Explore the repo.
 
# IMPLEMENTATION
 
Complete the task.
 
# FEEDBACK LOOPS
 
Before committing, run only the feedback loops that apply to your change. Scope them to what you actually touched — `make pre-push` is the fallback, not the default.
 
Always invoke checks through the `make` wrappers (never call the underlying binaries directly — keeps behavior consistent with CI). Pick based on what changed:
 
- **PHP files changed**`make pint`, `make phpstan`, `make pest`. Add `make rector` only if refactoring patterns may apply.
- **Frontend files changed (TS/TSX/JS/CSS)**`make lint` and `make format`.
- **Pure test-addition tickets (no production code touched)**`make pint` and `make pest`. Skip phpstan/rector/lint/format. Skip `laravel-taste-validator` too — it has nothing to validate.
- **Mixed / unsure / cross-cutting changes**`make pre-push` (rector + pint + phpstan + pest + lint + format).
- **App-layer PHP changes (controllers, models, actions, policies, jobs, form requests)** — also run the `laravel-taste-validator` skill.
 
The goal: run only the make targets that apply, not the full `make pre-push` gate every time.
 
# COMMIT
 
Make a git commit. The commit message must:
 
1. Include key decisions made
2. Include files changed
3. Blockers or notes for next iteration
 
# FEEDBACK
Append to `ralph/feedback.md` (do NOT overwrite — previous sessions' notes stay) any feedback or improvement we need to do on your environment setup so you can work more efficiently and don't spend time working on these issues. Start your section with a heading like `## Session YYYY-MM-DD — issue #N` so entries are distinguishable. Don't commit this file.
 
# THE ISSUE
 
If the task is complete, close the original GitHub issue.
 
If the task is not complete, leave a comment on the GitHub issue with what was done.
 
# FINAL RULES
 
ONLY WORK ON A SINGLE TASK.

A few things that ended up mattering more than I expected while tuning this prompt:

  • ONLY WORK ON A SINGLE TASK. at the bottom. Without it, the agent would sometimes see two related issues and try to ship both in one commit, which defeated the "one task per iteration" loop and made failures hard to bisect.
  • Explicit priority order (bugfixes → infra → tracer bullets → polish → refactors). When I left this open, the agent gravitated toward refactors because they were the most "interesting" issues in the queue. Pinning the order fixed it.
  • Commit message requirements (decisions, files, blockers). The blockers line is what lets the next iteration — which only sees the last 5 commits — pick up where this one left off. It's context handoff disguised as a commit message.
  • FEEDBACK is append-only and not committed. If I let the agent overwrite it or commit it, I'd lose the cross-session signal that drove most of the improvements in the "what I learned the hard way" list below.

Why feedback loops are scoped, not global

The prompt originally ended every iteration with an unconditional make pre-push — rector + pint + phpstan + pest + lint + format. That's the right gate for CI, but running all six on every tracer-bullet change was slow enough that the loop spent more time linting than coding.

I rewrote that section to push the decision down to the agent. Now the prompt gives scoped guidance:

  • PHP onlymake pint, make phpstan, make pest (add make rector if patterns may apply).
  • Frontend onlymake lint and make format.
  • Pure test additionsmake pint and make pest only.
  • Mixed / cross-cutting → fall back to make pre-push.
  • App-layer PHP (controllers, models, actions, policies, jobs, form requests) → also run the laravel-taste-validator skill.

All checks still go through make wrappers to stay consistent with CI — but the agent picks the subset that matches the diff instead of running the whole gate every time. This was the single biggest speedup to the loop.

The laravel-taste-validator skill

For app-layer PHP changes the prompt also routes the agent through a custom skill called laravel-taste-validator. It lives in .claude/skills/laravel-taste-validator/ and codifies the conventions I care about — naming and code style, action/controller architecture, Eloquent and migration patterns, form requests and error handling, queue jobs and testing style — split into five principle groups.

The skill works by dispatching laravel-architect subagents in parallel, one per group. Each subagent reads its rules file, reads the files under review, and returns a list of violations with file paths, line numbers, and corrected code examples. The main agent then has a concrete diff to apply — it's not "this feels off", it's "line 42 violates rule X, replace with Y".

In practice that turns convention-following into another feedback loop: the agent writes a controller, the skill flags the violations it didn't catch on the first pass, and the agent fixes them before the commit. I don't have to hand-review every PR to keep the codebase in one voice — the skill encodes the voice, and the agent auto-corrects against it.

GitHub issues as the task queue

One thing worth naming explicitly: the backlog isn't a TODO file or a Notion board — it's the repo's own GitHub issues. That choice does a lot of work for free:

  • Triage happens in one place. Adding afk or hitl labels to an issue is how I decide whether the loop is allowed to touch it.
  • State lives on the issue. If Claude can't finish a task, the prompt tells it to leave a comment describing what got done; the next iteration (or I) read that comment and continue. No ad-hoc state files.
  • Completion is closing the issue. When a task is done, Claude closes the issue from inside the sandbox (gh issue close), and that single action removes it from the next iteration's context.
  • Humans and agents share the queue. I can grab an issue, Claude can grab an issue, nothing special has to reconcile the two.

Feedback as a compounding improvement mechanism

The FEEDBACK step in the prompt — "append to ralph/feedback.md anything that slowed you down" — ended up being one of the most valuable parts of the whole setup. Every iteration the agent flagged real friction it hit: tests timing out under the wrong DB, a missing PHP extension, a phpunit.xml override silently defeating Postgres, a browser helper pointing at a removed namespace, a make pre-push target that was redundant and burned minutes.

Most of the fixes in this post came directly from that file. I read through the session notes, cross-referenced them with what I'd seen in the terminal, and made the matching changes to the Dockerfile, setup.sh, afk.sh, or the prompt itself. The loop improved week over week not because I sat down to redesign it, but because the agents kept telling me where it hurt — and the append-only log made those complaints impossible to lose between sessions.

Step 7: Hoist setup.sh out of the loop

This one started as a small optimization and ended up surfacing a chain of latent bugs that had been masked for weeks. It's worth walking through because the moral isn't "here's another tweak" — it's that a whole class of error-swallowing idioms I'd been writing on autopilot were covering up real failures.

The optimization

afk.sh used to inject "Before starting, run ralph/setup.sh to set up the environment." into every iteration's prompt. Claude would execute the script as its first Bash tool call on every loop. setup.sh is idempotent, so the work was redundant after iteration 1 — but every invocation still paid the cost of a full Claude tool-call round-trip before the iteration could do anything useful.

The fix was mechanical: move ralph/setup.sh out of the prompt, invoke it once at the top of afk.sh, before the for loop. Setup cost dropped from O(iterations) to O(1).

What made the change interesting is what it exposed. In the old shape, each iteration got a fresh chance to stumble through any transient setup failure — Claude could retry, or the next iteration would mask whatever went wrong in the last. Running setup once, from a bare shell, removed the safety net. Five latent bugs fell out in sequence.

Bug 1: pg_ctl was silently failing and the script kept going

The original line was:

/usr/lib/postgresql/17/bin/pg_ctl -D /home/agent/pgdata -l /home/agent/pgdata/logfile start 2>/dev/null || true

Two compounding error-swallowers. 2>/dev/null discards the detail, || true downgrades a fatal exit into a shrug. On the first run after hoisting, pg_ctl printed "waiting for server to start.... stopped waiting" — the postmaster launched but never became ready — and the script continued into psql and createdb calls that all failed with Connection refused. The visible error was the cascade, not the root cause.

Why did the loop ever work? Because on a sandbox that had been up for a while, Postgres was already running from a previous session, and pg_ctl start against a running server is an error that || true correctly absorbs. The old script only worked when someone else had already successfully started Postgres for it.

The fix strips the defensive shims and replaces them with a state machine:

PG=/usr/lib/postgresql/17/bin
 
if ! "$PG/pg_isready" -h 127.0.0.1 -q; then
  if [ -f /home/agent/pgdata/postmaster.pid ] && ! pgrep -F /home/agent/pgdata/postmaster.pid >/dev/null 2>&1; then
    rm -f /home/agent/pgdata/postmaster.pid
  fi
  : > /home/agent/pgdata/logfile
  "$PG/pg_ctl" -D /home/agent/pgdata -l /home/agent/pgdata/logfile -w start || {
    echo "postgres failed to start; logfile:"
    cat /home/agent/pgdata/logfile
    exit 1
  }
fi

pg_isready -q is the idempotency check — no more relying on || true to absorb "already running". pgrep -F handles a stale PID file left behind by a dead process (Postgres refuses to start if the file exists and the referenced process isn't around to claim it). pg_ctl -w start waits synchronously, so by the time control returns, psql calls will succeed. And the failure branch is loud: it prints the logfile and exit 1s.

Bug 2: /var/run/postgresql didn't exist

Once pg_ctl actually surfaced its logfile, the real failure finally showed up:

FATAL: could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": No such file or directory

/var/run is tmpfs — its contents are wiped on every reboot. The Dockerfile creates /run/postgresql at build time with the right ownership, but that lives in the image's filesystem layer. The running container mounts a fresh tmpfs over /var/run and whatever the Dockerfile put there is gone the moment the container starts.

The previous line was defending against the wrong failure:

sudo chown agent:agent /var/run/postgresql 2>/dev/null || true

It assumed the directory existed and might have the wrong ownership. On a fresh container the directory didn't exist at all, chown failed, and the swallowers absorbed it. One line fixed it:

sudo mkdir -p /var/run/postgresql
sudo chown agent:agent /var/run/postgresql

Bug 3: the logfile was cumulative, so diagnostics were buried

When I started dumping the logfile on pg_ctl failure, the first attempt tailed the last 50 lines — and those lines turned out to be mostly stale noise from sessions days prior. The actual FATAL: could not create lock file line was at the very bottom, behind a wall of unrelated history. A logfile that accumulates across sessions is strictly worse than one scoped to the current session when what you want is "show me why startup just failed." Fixed by truncating before start:

: > /home/agent/pgdata/logfile

Bug 4: unsetting ANTHROPIC_API_KEY inside setup.sh did nothing useful

After the hoist, the first iteration appeared to hang after "server started" with no progress. The Claude CLI was picking up ANTHROPIC_API_KEY from the environment and routing every request through the API instead of the subscription — metering credits and, for that account, approaching its cap.

The existing guard was in setup.sh:

[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEY

Correct in isolation, wrong file. setup.sh is invoked as a child process (./ralph/setup.sh, not source), so unset mutates only the subshell's environment — the afk.sh parent still has the key, and every claude call inherits it. The old shape appeared to work because we simply hadn't stress-tested it; this run was the first time the account hit its API cap, so the first time the difference between API auth and subscription auth became observable.

The fix moves the unset to afk.sh itself, before the loop. setup.sh never invokes claude, so the line doesn't belong there — putting it in both files would be two statements of intent instead of one, and would invite the same confusion the next time someone edits one without the other.

Bug 5: make install-pnpm ran every time, even when it didn't need to

On my laptop pnpm install runs on the host (macOS), so node_modules ends up with Darwin-arm64 native binaries — esbuild, playwright, swc, sharp. Inside the Linux sandbox those binaries are the wrong ABI. The initial, conservative fix was "always reinstall at setup time" — correct, but a full pnpm install burns 30–60 seconds even when the sandbox already has a valid Linux dependency tree from a previous session.

I want the install to run on the first sandbox entry after a host-side install (or after switching branches with incompatible lockfiles) and skip otherwise. The trick is choosing a check that distinguishes "node_modules works here" from "node_modules is wrong-arch" without parsing pnpm's lock state. Runtime sentinel:

if ! pnpm exec vite --version >/dev/null 2>&1; then
  make install-pnpm
fi

vite is a direct dep, and vite pulls in esbuild, which ships a platform-specific native binary. Running vite --version exercises the esbuild require chain. If node_modules is missing, partially installed, or has Darwin binaries on Linux, this fails almost immediately. If it succeeds, the install is demonstrably usable on this platform, and we skip. Costs ~100–300 ms when the install is valid, saves 30–60 seconds when it isn't.

The general lessons

Three themes ran through all five bugs:

  1. 2>/dev/null and || true are debt, not defenses. Every error-swallower was hiding a real failure. The right idiom for "it's ok if this fails" is to name the specific failure mode and handle it explicitly (pg_isready check, stale-PID cleanup, mkdir -p), not to blanket-suppress.
  2. Loops hide bugs that single-shot scripts don't. A setup script failing loudly is better than a loop that retries into a working state. Retry masked four different root causes here; straight-line execution surfaced them all in a single session.
  3. Environment mutations only affect their own process. unset in a child script cannot mutate the parent. If the claude call needs a clean env, the unset belongs in the file that invokes claude.

Useful commands

CommandPurpose
sbx run shell --template guetteman/claude-php-8.4Create the sandbox the first time (pulls the template)
sbx run shellRe-enter the existing sandbox (all subsequent sessions)
sbx lsList running sandboxes
sbx rm <id>Remove a sandbox
docker build -t guetteman/claude-php-8.4 ralph/ && docker push guetteman/claude-php-8.4Rebuild and publish the template
./ralph/afk.sh 20 (inside sandbox)Run the autonomous loop for 20 iterations

Limitations worth knowing before you start

  • Cannot execute host binaries (Herd's PHP, herd CLI, etc.).
  • No raw TCP/UDP/ICMP — only HTTP/HTTPS passes through the network proxy. Host databases and Herd's *.test domains are unreachable.
  • sbx doesn't accept --env, --network, or arbitrary mount paths.
  • sbx only pulls from registries — local-only images aren't usable.
  • sbx run ... "<prompt>" doesn't echo the prompt back to the terminal. Use sbx run shell and drive the loop from inside.
  • PHP 8.5 packages from the ondrej PPA have dependency conflicts with the base image (Ubuntu 25.10); PHP 8.4 from default repos works cleanly.
  • Docker Desktop's docker sandbox caps memory at 4 GB, which is not enough for Claude + Postgres + ParaTest + Playwright.

What I learned the hard way

A short change log of the fixes this setup has accumulated — each one came from something that actually broke:

  • Switched docker sandboxsbx to escape the 4 GB memory cap.
  • Published the image to Docker Hub (guetteman/claude-php-8.4) because sbx only pulls from registries.
  • Moved the afk loop inside the sandbox shell (sbx run shell ... then ./ralph/afk.sh) because sbx's non-interactive mode doesn't show the prompt.
  • Rewrote afk.sh's jq filter to consume stream_event partial messages and print tool-call summaries ([Bash] ..., [Read] ...) instead of raw JSON.
  • Added setup.sh to replace the inline bootstrap commands that used to be arguments to docker sandbox run. It starts Postgres, fixes /run/postgresql ownership, reencodes template1 to UTF8 for ParaTest, and creates the app databases.
  • Replaced unconditional make pre-push with scoped feedback loops in ralph/prompt.md. The agent now picks the subset of make targets that match the diff instead of running the full CI gate every iteration — the single biggest speedup to the loop.
  • Added first-run bootstrap: install the Claude CLI + claude login, gh auth login, and unset ANTHROPIC_API_KEY so the CLI uses the Claude subscription instead of API credits.
  • Added PCOV to the Dockerfile for Pest coverage.
  • Added pnpm and Playwright + Chromium deps to the Dockerfile for Pest 4 browser tests.
  • Fixed phpunit.xml to stop overriding .env.testing with SQLite :memory:, which was defeating ParaTest.
  • Hoisted setup.sh out of the iteration loop so it runs once per session instead of once per iteration. See Step 7 for the five latent bugs this exposed.
  • Moved unset ANTHROPIC_API_KEY from setup.sh to afk.sh — the unset in a child process doesn't mutate the parent's environment; it has to live in the file that invokes claude.
  • Replaced blanket error-swallowing (2>/dev/null || true) in setup.sh with explicit state checks: pg_isready for idempotency, stale postmaster.pid cleanup via pgrep -F, pg_ctl -w for synchronous readiness, logfile truncation before start, and a loud failure branch that cats the logfile and exit 1s.
  • Added sudo mkdir -p /var/run/postgresql before the chown, because /var/run is tmpfs and the directory the Dockerfile created doesn't survive a container restart — the actual root cause hiding behind weeks of intermittent "Connection refused" errors.
  • Skipped make install-pnpm when pnpm exec vite --version succeeds, using vite's esbuild native-binary dependency as a cheap sentinel for "are these node_modules valid on this platform." Saves 30–60 seconds per sandbox entry when the Linux install from a previous session is intact.

References