Shipping Code AFK: A Docker Sandbox and Ralph Loop for Laravel and Claude Code
A step-by-step guide to running Claude Code unattended on a Laravel project — building the sandbox image, wiring up a self-driving loop, and learning what breaks the hard way.
If you have a backlog you'll never finish, stop working tickets by hand. Most of us use coding agents the wrong way. We sit there. We approve tool calls. We babysit a 20-minute task and call that leverage. It isn't. It's a slightly faster version of doing the work yourself, with the agent as a very expensive pair of hands that can't be trusted alone in the room.
The shift I'm after is different. Claude picks an issue off the backlog while I'm doing something else. Builds it. Runs the tests. Commits. Closes the issue. Moves to the next one. I come back to a queue that's shorter than I left it.
That shift has a name now. "Ralph loop" was coined by Geoffrey Huntley in Ralph Wiggum as a 'software engineer' (July 2025). In its purest form it's literally while :; do cat PROMPT.md | claude-code ; done — a dumb, persistent while true that pipes the same prompt into a coding agent until the work is done. The joke is that Ralph Wiggum is the agent: lovable, not very bright, relentlessly persistent, and every time he falls off the slide you add another sign to the prompt ("SLIDE DOWN, DON'T JUMP"). It's funny. It's also the cheapest known path from "AI as autocomplete" to "AI as a coworker with a ticket queue."
Two pieces make it safe. A Docker sandbox, so Claude can't touch your host, your database, or anything on your LAN. And a Ralph loop, so it keeps going until the queue is empty. Miss either piece and you get the thing everyone's afraid of, an agent with rm -rf energy and no walls.
This post walks through the whole setup on a Laravel project. Herd on the host, Postgres inside the sandbox, Pest 4 with ParaTest, Playwright for browser tests. Most of the shape of the workflow comes from Matt Pocock's Claude Code for Real Engineers cohort — the Ralph loop pattern, sandboxed autonomous agents, and the feedback-loop mindset are all from there. Huge thanks to Matt. What this post adds is the Docker + Laravel + Herd + Postgres + Pest 4 substrate, and the specific things that broke along the way.
Why not use Sandcastle?
A fair question. Matt also maintains Sandcastle, a TypeScript library that productizes exactly this pattern — sandbox provider, iteration loop, completion signal, branch strategy, the lot. It's a great piece of software and if you want the Ralph loop as an API call, it's the obvious choice.
I didn't use it for two reasons:
- I'm cautious about third-party tooling on top of my Claude subscription. Sandcastle invokes the official
claudeCLI inside a container using my own auth, so there's no obvious ToS issue with the library itself. But unattended long-running loops are already an unusual usage pattern, and I'd rather not add an extra layer that could look, from Anthropic's side, like something other than me driving the CLI. The risk is probably small. It's not zero, and I'd rather not find out. - I want full control. Every piece of this setup — the Dockerfile,
setup.sh, the jq filter inafk.sh, the prompt — changed multiple times as I hit real problems. Owning those files outright meant the fix was always "edit this script"; with an orchestration layer in the middle, some fixes would've meant "patch the library or route around it." For a personal workflow I run every day, the DIY version is genuinely less friction.
Neither reason is an argument against Sandcastle. Just an explanation of why this post describes a handful of shell scripts instead of an import.
The shape of unattended coding
Before any Dockerfile, there's a mental model worth getting straight. Skip it and the tactical steps read like a cargo cult.
Three ideas do all the work.
The sandbox is what unlocks the agent. The reason coding agents feel like a ceremony is --permission-mode bypassPermissions. Nobody wants to grant it on their host, and they're right not to. But put the agent inside a container that can't reach your host, your LAN, or anything past an HTTP proxy, and bypassing permissions stops being reckless. It becomes the only sane default. The walls are what let the agent move fast.
That's the missing piece. Not the loop. Not the prompt. The boundary.
The backlog is the interface. A Ralph loop needs somewhere to pull tasks from. You could use a TODO file. Don't. Use your repo's GitHub issues, tag them afk or hitl, and let the agent pull from the same queue you do. Completion is closing the issue. Handoff is a comment. Humans and agents share the board, no reconciliation, no special state.
Feedback compounds. Every iteration, the agent writes to a feedback.md file with what was slow, what was confusing, what broke. You read it the next morning and patch the Dockerfile, the prompt, or the setup script. The loop gets better week over week not because you sit down to redesign it, but because the agent keeps telling you where it hurts.
Hold those three ideas in your head. Everything below is plumbing in service of them.
The workflow at a glance
- I triage backlog into GitHub issues and tag them either AFK (safe for the loop to pick up) or HITL (human in the loop — design calls, risky refactors, anything I want to drive myself).
- I open a sandbox with
sbx run shelland run./ralph/afk.sh Ninside it. - Each iteration,
afk.shfeeds Claude the last 5 commits + open GitHub issues +ralph/prompt.mdand lets it pick one AFK issue, build it end-to-end, run the scoped feedback loops, commit, close the issue, and log friction toralph/feedback.md. - When there are no AFK issues left, Claude emits
<promise>NO MORE TASKS</promise>and the loop exits.
Claude runs with --permission-mode bypassPermissions inside the sandbox. No tool-call prompts. No babysitting. The network proxy is the only thing standing between the agent and the outside world — and that's enough.
The ralph/ file layout
| File | Purpose |
|---|---|
Dockerfile | Builds the sandbox template (PHP, Postgres, pnpm, Playwright). |
setup.sh | Runs once per session at the top of afk.sh — boots Postgres, fixes perms, ensures UTF8 template1, skips pnpm install when node_modules is valid. |
afk.sh | The N-iteration loop itself (stream-json + jq filter + stop-token watch). |
prompt.md | The playbook Claude reads every iteration. |
feedback.md | Append-only log of friction the agent hit; not committed. |
Step 1: Pick the right sandbox runner
There are two options, and the choice matters.
I started with docker sandbox — the subcommand shipped with Docker Desktop. It works. It's also capped at 4 GB of memory, and that's not enough.
Claude, Postgres, ParaTest with 10 workers, Vite, Playwright. Something gets killed mid-iteration. Usually Postgres. The Linux out-of-memory killer shoots whichever process is the biggest offender when a container hits its ceiling, and you only find out when the tests fail for reasons that make no sense.
Docker ships a second CLI called sbx that creates sandboxes without the Desktop memory limit. I switched. The out-of-memory kills went away. Two gotchas come with that switch:
sbxonly pulls images from a registry — it doesn't use local Docker images. Building the template locally (docker build -t claude-php-8.4 ralph/) is not enough. You have to push it to Docker Hub (or another registry thatsbxcan pull from) and reference it by its registry name.sbx rundoesn't print the initial prompt back to stdout the waydocker sandbox run claude . -- "..."did. Passing a long prompt as an argument works, but you don't see what Claude received, which made debugging the loop painful. The workaround (run the loop from inside an interactive shell) is in Step 5.
Prereqs from here on out: the sbx CLI and a Docker Hub account to host the template image.
Step 2: Build the custom sandbox image
The base docker/sandbox-templates:claude-code image includes Node.js, Git, GitHub CLI, ripgrep, jq, Go, and Python 3 — but no PHP or PostgreSQL. I extend it with PHP 8.4 (all Laravel Cloud-supported extensions), PostgreSQL 17, Composer, pnpm, and Playwright's browser dependencies.
The Dockerfile lives at ralph/Dockerfile:
FROM docker/sandbox-templates:claude-code
USER root
RUN apt-get update && apt-get install -y \
php8.4 \
php8.4-cli \
php8.4-common \
php8.4-apcu \
php8.4-bcmath \
php8.4-bz2 \
php8.4-curl \
php8.4-dba \
php8.4-enchant \
php8.4-excimer \
php8.4-gd \
php8.4-gmp \
php8.4-igbinary \
php8.4-imagick \
php8.4-imap \
php8.4-intl \
php8.4-ldap \
php8.4-mbstring \
php8.4-mongodb \
php8.4-msgpack \
php8.4-mysql \
php8.4-odbc \
php8.4-opcache \
php8.4-pcov \
php8.4-pgsql \
php8.4-pspell \
php8.4-readline \
php8.4-redis \
php8.4-snmp \
php8.4-soap \
php8.4-sqlite3 \
php8.4-tidy \
php8.4-xml \
php8.4-zip \
postgresql \
postgresql-client \
unzip \
libnss3 \
libnspr4 \
libatk1.0-0t64 \
libatk-bridge2.0-0t64 \
libatspi2.0-0t64 \
libcups2t64 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2t64 \
libxshmfence1 \
libx11-xcb1 \
libdrm2 \
libxkbcommon0 \
fonts-liberation \
&& rm -rf /var/lib/apt/lists/*
COPY --from=composer:2.9.5 /usr/bin/composer /usr/bin/composer
RUN mkdir -p /home/agent/pgdata /run/postgresql \
&& chown -R agent:agent /home/agent/pgdata /run/postgresql
USER agent
RUN npm install -g pnpm
RUN /usr/lib/postgresql/17/bin/initdb -D /home/agent/pgdata --encoding=UTF8 --locale=C.UTF-8
RUN npm install -g playwright@latest
RUN npx playwright installA few things worth calling out:
php8.4-pcovso Pest can produce coverage reports inside the sandbox.- A long list of Playwright/Chromium shared libraries (
libnss3,libatk1.0-0t64,libcups2t64,libgbm1, etc.) required for Pest 4 browser tests. pnpmglobally vianpm install -g pnpm.playwrightglobally and its browsers vianpx playwright install.- A pre-initialized PostgreSQL data directory owned by
agentso Postgres can start without root at runtime. I initialize it with--encoding=UTF8 --locale=C.UTF-8because ParaTest's per-worker databases inherittemplate1's encoding, and a mismatched encoding blocks parallel test runs (more on this in Step 4).
Build it and push it:
docker build -t guetteman/claude-php-8.4 ralph/
docker push guetteman/claude-php-8.4The push is what makes the template visible to sbx. Skip it and sbx run --template guetteman/claude-php-8.4 ... will fail to pull. Now the image exists, but a running container is not a running environment. Postgres still needs to boot, permissions still need fixing, and a stray ANTHROPIC_API_KEY still needs to get unset before it costs you real money.
Step 3: Decide where Postgres runs (spoiler: inside the sandbox)
Your first instinct will be to point the sandbox at Herd's Postgres on the host. It won't work.
Docker sandboxes, both docker sandbox and sbx, only allow HTTP/HTTPS outbound traffic. Raw TCP is blocked at the network layer. That means Postgres (5432), MySQL (3306), Redis, anything on the LAN is unreachable. A regular docker run can punch through with --network host or --add-host. Sandboxes strip those flags out on purpose. The walls are the feature.
So Postgres moves inside. Your .env gets these credentials.
DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=resumeskit
DB_USERNAME=root
DB_PASSWORD=
No password, since PostgreSQL uses trust authentication for local connections. Which means every new iteration needs to boot Postgres from scratch, fix its ownership, and make sure template1 is UTF8 before ParaTest clones from it. That's what setup.sh is for.
Step 4: Write setup.sh — what the sandbox needs before Claude starts
ralph/setup.sh runs once at the top of afk.sh, before the iteration loop begins. It handles everything that can't live in the image because it depends on mounted project state, on the sandbox's runtime user IDs, or on tmpfs state that gets wiped on every container boot:
#!/bin/bash
set -eo pipefail
if ! pnpm exec vite --version >/dev/null 2>&1; then
make install-pnpm
fi
# /var/run is tmpfs; the dir the Dockerfile created is gone on every boot.
sudo mkdir -p /var/run/postgresql
sudo chown agent:agent /var/run/postgresql
PG=/usr/lib/postgresql/17/bin
if ! "$PG/pg_isready" -h 127.0.0.1 -q; then
if [ -f /home/agent/pgdata/postmaster.pid ] && ! pgrep -F /home/agent/pgdata/postmaster.pid >/dev/null 2>&1; then
rm -f /home/agent/pgdata/postmaster.pid
fi
: > /home/agent/pgdata/logfile
"$PG/pg_ctl" -D /home/agent/pgdata -l /home/agent/pgdata/logfile -w start || {
echo "postgres failed to start; logfile:"
cat /home/agent/pgdata/logfile
exit 1
}
fi
"$PG/psql" -h 127.0.0.1 -d postgres -c 'CREATE ROLE root WITH LOGIN SUPERUSER;' 2>/dev/null || true
if ! "$PG/psql" -h 127.0.0.1 -d postgres -tAc \
"SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname='template1'" \
| grep -q '^UTF8$'; then
"$PG/psql" -h 127.0.0.1 -d postgres <<'SQL'
UPDATE pg_database SET datistemplate=FALSE WHERE datname='template1';
DROP DATABASE template1;
CREATE DATABASE template1 WITH TEMPLATE = template0 ENCODING = 'UTF8';
UPDATE pg_database SET datistemplate=TRUE WHERE datname='template1';
SQL
fi
"$PG/createdb" -h 127.0.0.1 resumeskit 2>/dev/null || true
"$PG/createdb" -h 127.0.0.1 test 2>/dev/null || trueAll steps are idempotent — re-running setup.sh on an already-booted sandbox is a no-op. The early draft of this script was shorter and more defensive (lots of 2>/dev/null || true), and that turned out to be a mistake. The current shape — explicit state checks, loud failures — only exists because I hoisted setup out of the loop and watched five latent bugs fall out in sequence. That's Step 7.
ParaTest and template1 encoding
When I enabled parallel tests I hit this from 9 of 10 ParaTest workers:
new encoding (UTF8) is incompatible with the encoding of the
template database (SQL_ASCII).
ParaTest creates per-worker databases by cloning template1. If template1 isn't UTF8, every CREATE DATABASE ... ENCODING 'UTF8' call fails. I fixed this in two places:
- Dockerfile
initdb --encoding=UTF8 --locale=C.UTF-8so new images ship UTF8 from the start. setup.shre-encodestemplate1idempotently for sandboxes built against an older image. The script drops and recreatestemplate1fromtemplate0withENCODING = 'UTF8'only if it isn't already UTF8.
A related fix at the Laravel layer: phpunit.xml used to override .env.testing with SQLite :memory:, which silently defeated the Postgres setup. I removed the override so .env.testing's PostgreSQL credentials actually apply during tests.
Unset ANTHROPIC_API_KEY in afk.sh, not setup.sh
afk.sh clears the API key once, before the loop:
[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEYThe guard keeps set -u / set -o pipefail happy when the variable is already absent. Without this, if the host's ANTHROPIC_API_KEY leaks through into the sandbox environment, the CLI silently picks it up and bills API credits instead of using your Claude subscription.
The unset has to live in afk.sh, not setup.sh. setup.sh is invoked as a child process (./ralph/setup.sh, not source), so unset inside it only mutates the subshell's environment — the parent afk.sh still has the key, and every claude invocation it spawns inherits it. Put the line in the file that invokes claude. More on this in Step 7.
Step 5: Run the loop from inside an interactive shell
My automation loop is ralph/afk.sh, which shells claude --verbose --print --output-format stream-json ... N times, formats the stream, and stops when Claude emits <promise>NO MORE TASKS</promise>.
Originally afk.sh wrapped the claude call with docker sandbox run claude . -- .... That broke after I moved to sbx: the CLI swallows the prompt text in its output, so the loop ran blind — I couldn't tell what Claude saw.
The fix is to not pass the workload as an argument to sbx at all. Open an interactive shell in the sandbox and run afk.sh from inside it:
# First time only — creates the sandbox from the template:
sbx run shell --template guetteman/claude-php-8.4
# Every subsequent session — reuses the existing sandbox:
sbx run shell
# Then, inside the sandbox:
./ralph/afk.sh 20First-time bootstrap inside the shell
The template ships Claude Code and gh, but neither is authenticated. On first entry:
- Install/update the Claude CLI if it's missing, then run
claudeonce and complete the interactive login. - Run
gh auth loginand complete the GitHub device flow — the loop shells out togh issue listevery iteration, so this has to work non-interactively afterwards. - Confirm
ANTHROPIC_API_KEYis unset (setup.shenforces this every iteration, but check once by hand).
Two consequences of running the loop from inside the shell:
- The script calls
claudedirectly (nodocker sandbox run claude . --prefix). It sets--permission-mode bypassPermissionsand--include-partial-messagesso I can stream tool-call summaries as they happen instead of waiting for full messages. afk.shinvokesralph/setup.shonce, before theforloop. The prompt passed toclaudeno longer mentions setup — pushing that bootstrap into Claude's first tool call on every iteration was burning a full round-trip (parse → schedule → stream → complete) per loop for a script that's idempotent anyway.
The stream is filtered by a jq program that turns each content_block_start / content_block_delta / content_block_stop event into a readable line — so instead of raw JSON you see [Bash] composer install, [Read] app/Models/User.php, etc. That filter lives inline in afk.sh.
Here's the full script:
#!/bin/bash
set -eo pipefail
if [ -z "$1" ]; then
echo "Usage: $0 <iterations>"
exit 1
fi
# jq filter to extract final result
final_result='select(.type == "result").result // empty'
# jq filter: stream text and format tool calls with readable summaries
stream_text='foreach inputs as $obj (
{tool: "", input: "", emit: null};
if $obj.type == "stream_event" then
if $obj.event.type == "content_block_delta" then
if $obj.event.delta.type == "text_delta" then
.emit = $obj.event.delta.text
elif $obj.event.delta.type == "input_json_delta" then
.input += ($obj.event.delta.partial_json // "")
| .emit = null
else .emit = null end
elif $obj.event.type == "content_block_start"
and $obj.event.content_block.type == "tool_use" then
.tool = $obj.event.content_block.name
| .input = ""
| .emit = null
elif $obj.event.type == "content_block_stop" then
if .tool != "" then
(.input | try fromjson catch {}) as $inp |
.emit = "\n[\(.tool)] \(
if .tool == "Bash" then ($inp.command // "")
elif .tool == "Read" then ($inp.file_path // "")
elif .tool == "Edit" then ($inp.file_path // "")
elif .tool == "Write" then ($inp.file_path // "")
elif .tool == "Grep" then ([$inp.pattern, $inp.glob, $inp.path] | map(select(. != null)) | join(" "))
elif .tool == "Glob" then ($inp.pattern // "")
elif .tool == "Agent" then ($inp.description // "")
else ($inp | to_entries | map(.key + "=" + (.value | tostring)) | join(" "))
end
)\n"
else .emit = null end
| .tool = ""
| .input = ""
elif $obj.event.type == "message_stop" then
.emit = "\n\n"
else .emit = null end
else .emit = null end;
.emit // empty
)'
[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEY
ralph/setup.sh
for ((i=1; i<=$1; i++)); do
tmpfile=$(mktemp)
trap "rm -f $tmpfile" EXIT
commits=$(git log -n 5 --format="%H%n%ad%n%B---" --date=short 2>/dev/null || echo "No commits found")
issues=$(gh issue list --state open --json number,title,body,comments)
prompt=$(cat ralph/prompt.md)
echo ""
echo "========================================="
echo " Loop $i / $1"
echo "========================================="
echo ""
claude --verbose --print --permission-mode bypassPermissions --output-format stream-json --include-partial-messages \
"Previous commits: $commits $issues $prompt" \
| grep --line-buffered '^{' \
| tee "$tmpfile" \
| jq -n --unbuffered -rj "$stream_text"
result=$(jq -r "$final_result" "$tmpfile")
if [[ "$result" == *"<promise>NO MORE TASKS</promise>"* ]]; then
echo "Ralph complete after $i iterations."
exit 0
fi
doneA few details worth pointing out:
setup.shruns once, before the loop. Earlier versions injected "Before starting, runralph/setup.shto set up the environment." into every iteration's prompt, which meant paying for a full Claude tool-call round-trip on every loop just to re-run an idempotent script. One invocation up front does the same job.teewrites the raw stream to a tmpfile whilejqformats it to the terminal. After the run finishes, a secondjqpulls out the finalresulttext and greps it for the stop token.- The stop token is
<promise>NO MORE TASKS</promise>. The<promise>tag is just a convention I picked — what matters is that it's a string Claude won't emit by accident, so the exit condition is unambiguous. --permission-mode bypassPermissionsis what makes this unattended at all — without it every tool call would block waiting for approval. It's only safe because the sandbox is the outer boundary.
Step 6: Write the prompt
The prompt is what gives every iteration its marching orders. afk.sh concatenates three things before firing Claude:
- The last 5 git commits (so the agent has recent context across iterations).
- The open GitHub issues (
gh issue list --json number,title,body,comments). - The contents of
ralph/prompt.md.
prompt.md itself is structured as a short playbook:
- ISSUES / TASK SELECTION — parse the issue list, work only on issues tagged AFK (not HITL), prioritize in a fixed order (critical bugfixes → dev infra → tracer bullets → polish → refactors), and emit
<promise>NO MORE TASKS</promise>when the queue is empty.afk.shwatches for that token and stops the loop. - EXPLORATION / IMPLEMENTATION — explore the repo, then build the task. One task per iteration.
- FEEDBACK LOOPS — which
makechecks to run before committing. - COMMIT / THE ISSUE — commit with decisions + files + blockers; close the issue when done, otherwise leave a progress comment.
- FEEDBACK — append session notes to
ralph/feedback.mdunder a dated heading.
Here's the full file:
# ISSUES
GitHub issues are provided at start of context. Parse it to get open issues with their bodies and comments.
You will work on the AFK issues only, not the HITL ones.
You've also been passed a file containing the last few commits. Review these to understand what work has been done.
If all AFK tasks are complete, output <promise>NO MORE TASKS</promise>.
# TASK SELECTION
Pick the next task. Prioritize tasks in this order:
1. Critical bugfixes
2. Development infrastructure
Getting development infrastructure like tests and types and dev scripts ready is an important precursor to building features.
3. Tracer bullets for new features
Tracer bullets are small slices of functionality that go through all layers of the system, allowing you to test and validate your approach early. This helps in identifying potential issues and ensures that the overall architecture is sound before investing significant time in development.
TL;DR - build a tiny, end-to-end slice of the feature first, then expand it out.
4. Polish and quick wins
5. Refactors
# EXPLORATION
Explore the repo.
# IMPLEMENTATION
Complete the task.
# FEEDBACK LOOPS
Before committing, run only the feedback loops that apply to your change. Scope them to what you actually touched — `make pre-push` is the fallback, not the default.
Always invoke checks through the `make` wrappers (never call the underlying binaries directly — keeps behavior consistent with CI). Pick based on what changed:
- **PHP files changed** — `make pint`, `make phpstan`, `make pest`. Add `make rector` only if refactoring patterns may apply.
- **Frontend files changed (TS/TSX/JS/CSS)** — `make lint` and `make format`.
- **Pure test-addition tickets (no production code touched)** — `make pint` and `make pest`. Skip phpstan/rector/lint/format. Skip `laravel-taste-validator` too — it has nothing to validate.
- **Mixed / unsure / cross-cutting changes** — `make pre-push` (rector + pint + phpstan + pest + lint + format).
- **App-layer PHP changes (controllers, models, actions, policies, jobs, form requests)** — also run the `laravel-taste-validator` skill.
The goal: run only the make targets that apply, not the full `make pre-push` gate every time.
# COMMIT
Make a git commit. The commit message must:
1. Include key decisions made
2. Include files changed
3. Blockers or notes for next iteration
# FEEDBACK
Append to `ralph/feedback.md` (do NOT overwrite — previous sessions' notes stay) any feedback or improvement we need to do on your environment setup so you can work more efficiently and don't spend time working on these issues. Start your section with a heading like `## Session YYYY-MM-DD — issue #N` so entries are distinguishable. Don't commit this file.
# THE ISSUE
If the task is complete, close the original GitHub issue.
If the task is not complete, leave a comment on the GitHub issue with what was done.
# FINAL RULES
ONLY WORK ON A SINGLE TASK.A few things that ended up mattering more than I expected while tuning this prompt:
ONLY WORK ON A SINGLE TASK.at the bottom. Without it, the agent would sometimes see two related issues and try to ship both in one commit, which defeated the "one task per iteration" loop and made failures hard to bisect.- Explicit priority order (bugfixes → infra → tracer bullets → polish → refactors). When I left this open, the agent gravitated toward refactors because they were the most "interesting" issues in the queue. Pinning the order fixed it.
- Commit message requirements (decisions, files, blockers). The blockers line is what lets the next iteration — which only sees the last 5 commits — pick up where this one left off. It's context handoff disguised as a commit message.
FEEDBACKis append-only and not committed. If I let the agent overwrite it or commit it, I'd lose the cross-session signal that drove most of the improvements in the "what I learned the hard way" list below.
Why feedback loops are scoped, not global
The prompt originally ended every iteration with an unconditional make pre-push — rector + pint + phpstan + pest + lint + format. That's the right gate for CI, but running all six on every tracer-bullet change was slow enough that the loop spent more time linting than coding.
I rewrote that section to push the decision down to the agent. Now the prompt gives scoped guidance:
- PHP only →
make pint,make phpstan,make pest(addmake rectorif patterns may apply). - Frontend only →
make lintandmake format. - Pure test additions →
make pintandmake pestonly. - Mixed / cross-cutting → fall back to
make pre-push. - App-layer PHP (controllers, models, actions, policies, jobs, form requests) → also run the
laravel-taste-validatorskill.
All checks still go through make wrappers to stay consistent with CI — but the agent picks the subset that matches the diff instead of running the whole gate every time. This was the single biggest speedup to the loop.
The laravel-taste-validator skill
For app-layer PHP changes the prompt also routes the agent through a custom skill called laravel-taste-validator. It lives in .claude/skills/laravel-taste-validator/ and codifies the conventions I care about — naming and code style, action/controller architecture, Eloquent and migration patterns, form requests and error handling, queue jobs and testing style — split into five principle groups.
The skill works by dispatching laravel-architect subagents in parallel, one per group. Each subagent reads its rules file, reads the files under review, and returns a list of violations with file paths, line numbers, and corrected code examples. The main agent then has a concrete diff to apply — it's not "this feels off", it's "line 42 violates rule X, replace with Y".
In practice that turns convention-following into another feedback loop: the agent writes a controller, the skill flags the violations it didn't catch on the first pass, and the agent fixes them before the commit. I don't have to hand-review every PR to keep the codebase in one voice — the skill encodes the voice, and the agent auto-corrects against it.
GitHub issues as the task queue
One thing worth naming explicitly: the backlog isn't a TODO file or a Notion board — it's the repo's own GitHub issues. That choice does a lot of work for free:
- Triage happens in one place. Adding
afkorhitllabels to an issue is how I decide whether the loop is allowed to touch it. - State lives on the issue. If Claude can't finish a task, the prompt tells it to leave a comment describing what got done; the next iteration (or I) read that comment and continue. No ad-hoc state files.
- Completion is closing the issue. When a task is done, Claude closes the issue from inside the sandbox (
gh issue close), and that single action removes it from the next iteration's context. - Humans and agents share the queue. I can grab an issue, Claude can grab an issue, nothing special has to reconcile the two.
Feedback as a compounding improvement mechanism
The FEEDBACK step in the prompt — "append to ralph/feedback.md anything that slowed you down" — ended up being one of the most valuable parts of the whole setup. Every iteration the agent flagged real friction it hit: tests timing out under the wrong DB, a missing PHP extension, a phpunit.xml override silently defeating Postgres, a browser helper pointing at a removed namespace, a make pre-push target that was redundant and burned minutes.
Most of the fixes in this post came directly from that file. I read through the session notes, cross-referenced them with what I'd seen in the terminal, and made the matching changes to the Dockerfile, setup.sh, afk.sh, or the prompt itself. The loop improved week over week not because I sat down to redesign it, but because the agents kept telling me where it hurt — and the append-only log made those complaints impossible to lose between sessions.
Step 7: Hoist setup.sh out of the loop
This one started as a small optimization and ended up surfacing a chain of latent bugs that had been masked for weeks. It's worth walking through because the moral isn't "here's another tweak" — it's that a whole class of error-swallowing idioms I'd been writing on autopilot were covering up real failures.
The optimization
afk.sh used to inject "Before starting, run ralph/setup.sh to set up the environment." into every iteration's prompt. Claude would execute the script as its first Bash tool call on every loop. setup.sh is idempotent, so the work was redundant after iteration 1 — but every invocation still paid the cost of a full Claude tool-call round-trip before the iteration could do anything useful.
The fix was mechanical: move ralph/setup.sh out of the prompt, invoke it once at the top of afk.sh, before the for loop. Setup cost dropped from O(iterations) to O(1).
What made the change interesting is what it exposed. In the old shape, each iteration got a fresh chance to stumble through any transient setup failure — Claude could retry, or the next iteration would mask whatever went wrong in the last. Running setup once, from a bare shell, removed the safety net. Five latent bugs fell out in sequence.
Bug 1: pg_ctl was silently failing and the script kept going
The original line was:
/usr/lib/postgresql/17/bin/pg_ctl -D /home/agent/pgdata -l /home/agent/pgdata/logfile start 2>/dev/null || trueTwo compounding error-swallowers. 2>/dev/null discards the detail, || true downgrades a fatal exit into a shrug. On the first run after hoisting, pg_ctl printed "waiting for server to start.... stopped waiting" — the postmaster launched but never became ready — and the script continued into psql and createdb calls that all failed with Connection refused. The visible error was the cascade, not the root cause.
Why did the loop ever work? Because on a sandbox that had been up for a while, Postgres was already running from a previous session, and pg_ctl start against a running server is an error that || true correctly absorbs. The old script only worked when someone else had already successfully started Postgres for it.
The fix strips the defensive shims and replaces them with a state machine:
PG=/usr/lib/postgresql/17/bin
if ! "$PG/pg_isready" -h 127.0.0.1 -q; then
if [ -f /home/agent/pgdata/postmaster.pid ] && ! pgrep -F /home/agent/pgdata/postmaster.pid >/dev/null 2>&1; then
rm -f /home/agent/pgdata/postmaster.pid
fi
: > /home/agent/pgdata/logfile
"$PG/pg_ctl" -D /home/agent/pgdata -l /home/agent/pgdata/logfile -w start || {
echo "postgres failed to start; logfile:"
cat /home/agent/pgdata/logfile
exit 1
}
fipg_isready -q is the idempotency check — no more relying on || true to absorb "already running". pgrep -F handles a stale PID file left behind by a dead process (Postgres refuses to start if the file exists and the referenced process isn't around to claim it). pg_ctl -w start waits synchronously, so by the time control returns, psql calls will succeed. And the failure branch is loud: it prints the logfile and exit 1s.
Bug 2: /var/run/postgresql didn't exist
Once pg_ctl actually surfaced its logfile, the real failure finally showed up:
FATAL: could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": No such file or directory
/var/run is tmpfs — its contents are wiped on every reboot. The Dockerfile creates /run/postgresql at build time with the right ownership, but that lives in the image's filesystem layer. The running container mounts a fresh tmpfs over /var/run and whatever the Dockerfile put there is gone the moment the container starts.
The previous line was defending against the wrong failure:
sudo chown agent:agent /var/run/postgresql 2>/dev/null || trueIt assumed the directory existed and might have the wrong ownership. On a fresh container the directory didn't exist at all, chown failed, and the swallowers absorbed it. One line fixed it:
sudo mkdir -p /var/run/postgresql
sudo chown agent:agent /var/run/postgresqlBug 3: the logfile was cumulative, so diagnostics were buried
When I started dumping the logfile on pg_ctl failure, the first attempt tailed the last 50 lines — and those lines turned out to be mostly stale noise from sessions days prior. The actual FATAL: could not create lock file line was at the very bottom, behind a wall of unrelated history. A logfile that accumulates across sessions is strictly worse than one scoped to the current session when what you want is "show me why startup just failed." Fixed by truncating before start:
: > /home/agent/pgdata/logfileBug 4: unsetting ANTHROPIC_API_KEY inside setup.sh did nothing useful
After the hoist, the first iteration appeared to hang after "server started" with no progress. The Claude CLI was picking up ANTHROPIC_API_KEY from the environment and routing every request through the API instead of the subscription — metering credits and, for that account, approaching its cap.
The existing guard was in setup.sh:
[ -n "${ANTHROPIC_API_KEY:-}" ] && unset ANTHROPIC_API_KEYCorrect in isolation, wrong file. setup.sh is invoked as a child process (./ralph/setup.sh, not source), so unset mutates only the subshell's environment — the afk.sh parent still has the key, and every claude call inherits it. The old shape appeared to work because we simply hadn't stress-tested it; this run was the first time the account hit its API cap, so the first time the difference between API auth and subscription auth became observable.
The fix moves the unset to afk.sh itself, before the loop. setup.sh never invokes claude, so the line doesn't belong there — putting it in both files would be two statements of intent instead of one, and would invite the same confusion the next time someone edits one without the other.
Bug 5: make install-pnpm ran every time, even when it didn't need to
On my laptop pnpm install runs on the host (macOS), so node_modules ends up with Darwin-arm64 native binaries — esbuild, playwright, swc, sharp. Inside the Linux sandbox those binaries are the wrong ABI. The initial, conservative fix was "always reinstall at setup time" — correct, but a full pnpm install burns 30–60 seconds even when the sandbox already has a valid Linux dependency tree from a previous session.
I want the install to run on the first sandbox entry after a host-side install (or after switching branches with incompatible lockfiles) and skip otherwise. The trick is choosing a check that distinguishes "node_modules works here" from "node_modules is wrong-arch" without parsing pnpm's lock state. Runtime sentinel:
if ! pnpm exec vite --version >/dev/null 2>&1; then
make install-pnpm
fivite is a direct dep, and vite pulls in esbuild, which ships a platform-specific native binary. Running vite --version exercises the esbuild require chain. If node_modules is missing, partially installed, or has Darwin binaries on Linux, this fails almost immediately. If it succeeds, the install is demonstrably usable on this platform, and we skip. Costs ~100–300 ms when the install is valid, saves 30–60 seconds when it isn't.
The general lessons
Three themes ran through all five bugs:
2>/dev/nulland|| trueare debt, not defenses. Every error-swallower was hiding a real failure. The right idiom for "it's ok if this fails" is to name the specific failure mode and handle it explicitly (pg_isreadycheck, stale-PID cleanup,mkdir -p), not to blanket-suppress.- Loops hide bugs that single-shot scripts don't. A setup script failing loudly is better than a loop that retries into a working state. Retry masked four different root causes here; straight-line execution surfaced them all in a single session.
- Environment mutations only affect their own process.
unsetin a child script cannot mutate the parent. If theclaudecall needs a clean env, the unset belongs in the file that invokesclaude.
Useful commands
| Command | Purpose |
|---|---|
sbx run shell --template guetteman/claude-php-8.4 | Create the sandbox the first time (pulls the template) |
sbx run shell | Re-enter the existing sandbox (all subsequent sessions) |
sbx ls | List running sandboxes |
sbx rm <id> | Remove a sandbox |
docker build -t guetteman/claude-php-8.4 ralph/ && docker push guetteman/claude-php-8.4 | Rebuild and publish the template |
./ralph/afk.sh 20 (inside sandbox) | Run the autonomous loop for 20 iterations |
Limitations worth knowing before you start
- Cannot execute host binaries (Herd's PHP,
herdCLI, etc.). - No raw TCP/UDP/ICMP — only HTTP/HTTPS passes through the network proxy. Host databases and Herd's
*.testdomains are unreachable. sbxdoesn't accept--env,--network, or arbitrary mount paths.sbxonly pulls from registries — local-only images aren't usable.sbx run ... "<prompt>"doesn't echo the prompt back to the terminal. Usesbx run shelland drive the loop from inside.- PHP 8.5 packages from the ondrej PPA have dependency conflicts with the base image (Ubuntu 25.10); PHP 8.4 from default repos works cleanly.
- Docker Desktop's
docker sandboxcaps memory at 4 GB, which is not enough for Claude + Postgres + ParaTest + Playwright.
What I learned the hard way
A short change log of the fixes this setup has accumulated — each one came from something that actually broke:
- Switched
docker sandbox→sbxto escape the 4 GB memory cap. - Published the image to Docker Hub (
guetteman/claude-php-8.4) becausesbxonly pulls from registries. - Moved the afk loop inside the sandbox shell (
sbx run shell ...then./ralph/afk.sh) becausesbx's non-interactive mode doesn't show the prompt. - Rewrote
afk.sh's jq filter to consumestream_eventpartial messages and print tool-call summaries ([Bash] ...,[Read] ...) instead of raw JSON. - Added
setup.shto replace the inline bootstrap commands that used to be arguments todocker sandbox run. It starts Postgres, fixes/run/postgresqlownership, reencodestemplate1to UTF8 for ParaTest, and creates the app databases. - Replaced unconditional
make pre-pushwith scoped feedback loops inralph/prompt.md. The agent now picks the subset ofmaketargets that match the diff instead of running the full CI gate every iteration — the single biggest speedup to the loop. - Added first-run bootstrap: install the Claude CLI +
claudelogin,gh auth login, and unsetANTHROPIC_API_KEYso the CLI uses the Claude subscription instead of API credits. - Added PCOV to the Dockerfile for Pest coverage.
- Added pnpm and Playwright + Chromium deps to the Dockerfile for Pest 4 browser tests.
- Fixed
phpunit.xmlto stop overriding.env.testingwith SQLite:memory:, which was defeating ParaTest. - Hoisted
setup.shout of the iteration loop so it runs once per session instead of once per iteration. See Step 7 for the five latent bugs this exposed. - Moved
unset ANTHROPIC_API_KEYfromsetup.shtoafk.sh— the unset in a child process doesn't mutate the parent's environment; it has to live in the file that invokesclaude. - Replaced blanket error-swallowing (
2>/dev/null || true) insetup.shwith explicit state checks:pg_isreadyfor idempotency, stalepostmaster.pidcleanup viapgrep -F,pg_ctl -wfor synchronous readiness, logfile truncation before start, and a loud failure branch thatcats the logfile andexit 1s. - Added
sudo mkdir -p /var/run/postgresqlbefore the chown, because/var/runis tmpfs and the directory the Dockerfile created doesn't survive a container restart — the actual root cause hiding behind weeks of intermittent "Connection refused" errors. - Skipped
make install-pnpmwhenpnpm exec vite --versionsucceeds, using vite's esbuild native-binary dependency as a cheap sentinel for "are these node_modules valid on this platform." Saves 30–60 seconds per sandbox entry when the Linux install from a previous session is intact.
References
- Ralph Wiggum as a 'software engineer' — Geoffrey Huntley's original post coining the Ralph loop.
- Claude Code for Real Engineers — Matt Pocock's cohort, where I learned to apply the Ralph loop inside a sandbox.
- Sandcastle — Matt's TypeScript library that packages the Ralph-loop-in-a-sandbox pattern as a reusable API.
- Docker Sandbox docs
- Docker Blog: Running agents in sandboxes