# Scheduler Flow (Single Source of Truth) Use this document for all scheduler runs. ## Canonical artifact paths All daily/weekly prompt files must reference run artifacts using these canonical directories: - `src/context/CONTEXT_.md` - `src/todo/TODO_.md` - `src/decisions/DECISIONS_.md` - `src/test_logs/TEST_LOG_.md` Prompt authors: do not use legacy unprefixed paths (`context/`, `todo/`, `decisions/`, `test_logs/`). ## Shared Agent Run Contract (Required for All Spawned Agents) Every agent prompt invoked by the schedulers (daily/weekly) MUST enforce this contract: 1. **Read baseline policy files before implementation**: - `torch/TORCH.md` - `KNOWN_ISSUES.md` - Canonical path note: active issues are tracked in root `KNOWN_ISSUES.md` (not `docs/KNOWN_ISSUES.md`) - `docs/agent-handoffs/README.md` - Recent notes in `docs/agent-handoffs/learnings/` and `docs/agent-handoffs/incidents/` 2. **Update run artifacts** in `src/context/`, `src/todo/`, `src/decisions/`, and `src/test_logs/` during the run, or explicitly document why each artifact update is not needed for that run. 3. **Capture reusable failures and unresolved issues**: - Record reusable failures in `docs/agent-handoffs/incidents/` - Record active unresolved reproducible items in `KNOWN_ISSUES.md` 4. **Execute memory retrieval before implementation begins**: - Run configured memory retrieval workflow before prompt execution (for example via `scheduler.memoryPolicyByCadence..retrieveCommand`) - Retrieval command MUST call real memory services (`src/services/memory/index.js#getRelevantMemories` with `ingestEvents` seeding) and MUST emit deterministic marker `MEMORY_RETRIEVED`. - Retrieval command MUST write cadence-scoped evidence artifacts: - `.scheduler-memory/retrieve-.ok` - `.scheduler-memory/retrieve-.json` containing operation inputs/outputs (`agentId`, `query`, seeded event count, ingested count, retrieved count). - **Agent Action**: Review `.scheduler-memory/latest//memories.md` for relevant context. 5. **Store memory after implementation and before completion publish**: - **Agent Action**: Write any new insights, learnings, or patterns to the file specified by `$SCHEDULER_MEMORY_FILE` (defaulting to `memory-updates/__.md`). - Run configured memory storage workflow after prompt execution (for example via `scheduler.memoryPolicyByCadence..storeCommand`) - Storage command MUST call real memory services (`src/services/memory/index.js#ingestEvents`, which uses ingestor/summarizer pipeline) and MUST emit deterministic marker `MEMORY_STORED`. - Storage command MUST ingest content from the file at `$SCHEDULER_MEMORY_FILE` (or fallback to `memory-update.md` if present). - Storage command MUST write cadence-scoped evidence artifacts: - `.scheduler-memory/store-.ok` - `.scheduler-memory/store-.json` containing operation inputs/outputs (`agentId`, input event count, stored count, generated summaries). 6. **Scheduler-owned completion/logging is mandatory**: - Spawned agents MUST NOT run `lock:complete`. - Spawned agents MUST NOT write final `task-logs//____completed.md` or `__failed.md` files. - Scheduler performs completion publish and writes final success/failure task logs after its own validation gates. ## Numbered MUST Procedure 1. Set cadence variables before any command: - `cadence` = `daily` or `weekly` - `log_dir` = `task-logs//` - `prompt_dir` = `src/prompts//` Note: branch naming (for example `agents//`) is orchestration-level behavior and is not used by `run-scheduler-cycle.mjs`. 2. Run preflight to build the exclusion set: ```bash if daily: `npm run lock:check:daily -- --json --quiet`; if weekly: `npm run lock:check:weekly -- --json --quiet` ``` Canonical exclusion rule: - Use `excluded` from the `npm run lock:check:` JSON output. - If `excluded` is unavailable, fallback to the union of `locked`, `paused`, and `completed` from that same JSON payload. Goose Desktop note: `npm run lock:check:` can emit large hermit wrapper logs. Use `--json --quiet` (as documented above). If the command still fails due to Goose hermit issues, apply the PATH workaround in `KNOWN_ISSUES.md` before rerunning. 3. Read policy file(s) once before the run loop. This step is conditional: if `torch/TORCH.md` is missing, continue without failing. ```bash test -f torch/TORCH.md && cat torch/TORCH.md || echo "No torch/TORCH.md found; continuing" ``` 4. Bootstrap log directories before listing files: ```bash mkdir -p ``` 5. When lock health preflight is enabled (`scheduler.lockHealthPreflight: true` or env `SCHEDULER_LOCK_HEALTH_PREFLIGHT=1`), verify relay/query health before selecting an agent or calling `lock:lock`: ```bash npm run lock:health -- --cadence ``` - Escape hatch: set `SCHEDULER_SKIP_LOCK_HEALTH_PREFLIGHT=1` to skip this check for local/offline workflows. - If preflight exits non-zero because every relay is unhealthy, write `_deferred.md` with reason `All relays unhealthy preflight`, include `incident_signal_id`, set `failure_category: lock_backend_error`, state `prompt not executed`, and stop before lock acquisition. - For other non-zero preflight failures, write `_failed.md` with reason `Lock backend unavailable preflight`, set `failure_category: lock_backend_error`, state `prompt not executed`, and include: - `relay_list` - `preflight_failure_category` - `preflight_stderr_excerpt` - `preflight_stdout_excerpt` - Always include any preflight alert payload (`preflight_alerts`) in the scheduler metadata for operational triage. 6. Find latest cadence log file, derive the previous agent, then choose the next roster agent not in exclusion set: ```bash ls -1 | sort | tail -n 1 ``` Selection algorithm (MUST be followed exactly): - Roster source: `torch/roster.json` and the key matching ``. - Let `roster` be that ordered array and `excluded` be the set from step 2's canonical exclusion rule. - Let `latest_file` be the lexicographically last filename in ``. - Determine `previous_agent` from `latest_file` using this precedence: 1. Parse YAML frontmatter from `/` and use key `agent` when present and non-empty. 2. Otherwise parse filename convention `____.md` and take ``. - If no valid `latest_file` exists, or parsing fails, or `previous_agent` is not in `roster`, treat as first run fallback. - First run fallback: - Read `scheduler.firstPromptByCadence.` from `torch-config.json` if present. - If that agent exists in `roster`, set `start_index = index(configured_agent)`. - Otherwise set `start_index = 0`. - Otherwise: `start_index = (index(previous_agent in roster) + 1) mod len(roster)`. - Round-robin scan: - Iterate offsets `0..len(roster)-1`. - Candidate index: `(start_index + offset) mod len(roster)` (wrap-around required). - Choose the first candidate whose agent is **not** in `excluded`. - If no candidate is eligible, execute step 7. Worked examples: - **Daily example** - `roster.daily = [audit-agent, ci-health-agent, const-refactor-agent, ...]` - `latest_file = 2026-02-13T00-10-00Z__ci-health-agent__completed.md` - `excluded = {const-refactor-agent, docs-agent}` - `previous_agent = ci-health-agent`, so `start_index` points to `const-refactor-agent`. - `const-refactor-agent` is excluded; skip to `content-audit-agent`. - **Selection result: `content-audit-agent`.** - **Weekly example** - `roster.weekly = [bug-reproducer-agent, changelog-agent, ..., weekly-synthesis-agent]` - `latest_file = 2026-02-09T00-00-00Z__weekly-synthesis-agent__completed.md` - `excluded = {}` - `previous_agent = weekly-synthesis-agent` (last roster entry), so `start_index = 0` by wrap-around. - First candidate is `bug-reproducer-agent` and is eligible. - **Selection result: `bug-reproducer-agent`.** 7. If every roster agent is excluded, write a `_failed.md` log with: `All roster tasks currently claimed by other agents` and stop. 8. Claim selected agent: ```bash AGENT_PLATFORM= \ npm run lock:lock -- --agent --cadence ``` - Exit `0`: lock acquired, continue. - Exit `3`: race lost/already locked, return to step 2. - Exit `2`: lock backend error. - If `scheduler.strict_lock` is `false`, defer the run while both budget constraints are still satisfied: - `degraded_lock_retry_window` (ms) since first backend failure has not elapsed. - `max_deferrals` has not been exceeded. - Record deferral metadata in scheduler run state (`attempt_count`, `first_failure_timestamp`, `backend_category`, and preserved idempotency key). - Otherwise write `_failed.md` with reason `Lock backend error`, and include failure metadata fields: - `backend_category` (classified backend failure category) - `lock_command` (raw lock command for retry) - `lock_stderr_excerpt` (redacted stderr snippet) - `lock_stdout_excerpt` (redacted stdout snippet) - Include `failure_class: backend_unavailable` and `failure_category: lock_backend_error` for both deferred and failed backend-unavailable lock outcomes. - Include recommended auto-remediation text in `detail`: retry window, `npm run lock:health -- --cadence `, and incident runbook link. - Keep generic reason text for compatibility, but append actionable retry guidance in `detail` using the command from `lock_command`. 9. Execute `/` end-to-end via configured handoff command. - Scheduler automation runs `scheduler.handoffCommandByCadence.` with environment variables for cadence/agent/prompt path. - If no handoff command is configured for the cadence, write `_failed.md` and stop. - Prompt file read/parse failures should emit `failure_category: prompt_parse_error`. - Prompt schema/contract failures should emit `failure_category: prompt_schema_error`. - Command/handoff/validation runtime failures should emit `failure_category: execution_error`. 10. Confirm memory contract completion: - Memory retrieval evidence must exist for this run (output marker and/or artifact file). - Memory storage evidence must exist for this run (output marker and/or artifact file). - Enforced daily/weekly commands in `torch-config.json` run `node --input-type=module` snippets that call `src/services/memory/index.js` APIs directly: - Retrieval path: `ingestEvents(...)` seed + `getRelevantMemories(...)` retrieval. - Storage path: `ingestEvents(...)` (ingestor + summarizer path). - Input contract for retrieval evidence JSON: - Required keys: `cadence`, `operation: "retrieve"`, `servicePath`, `inputs`, `outputs`, `status: "ok"`. - `inputs` must include `agentId`, `query`, and seeded `events` count. - `outputs` must include `ingestedCount` and `retrievedCount`. - Input contract for storage evidence JSON: - Required keys: `cadence`, `operation: "store"`, `servicePath`, `inputs`, `outputs`, `status: "ok"`. - `inputs` must include `agentId` and input `events` count. - `outputs` must include `storedCount` and generated `summaries`. - Failure semantics for required mode: - If retrieval/store command exits non-zero, scheduler writes `_failed.md` and stops. - If command succeeds but markers/artifacts are missing, scheduler treats this as missing memory evidence. - Prompt authors MUST keep any command changes aligned with configured markers/artifacts so scheduler evidence checks remain satisfiable. - If `scheduler.memoryPolicyByCadence..mode = required`, missing evidence is a hard failure. - If mode is `optional`, log warning context and continue. 11. Verify required run artifacts for the current run window. - Scheduler runs `node scripts/agent/verify-run-artifacts.mjs --since --check-failure-notes`. - If artifact verification exits non-zero: write `_failed.md` and stop. 12. Run repository checks (for example: `npm run lint`). - If any validation command exits non-zero: **fail the run immediately**, write `_failed.md` with the failing command and reason, and stop. - step 12 MUST NOT be executed (`lock:complete` is forbidden until validation passes). - In current numbering: when step 12 fails, step 13 MUST NOT run. 13. Publish completion before writing final success log: ```bash AGENT_PLATFORM= \ npm run lock:complete -- --agent --cadence ``` (Equivalent invocation is allowed: `torch-lock complete --agent --cadence `.) - Exit `0`: completion published successfully; continue to step 14. - Exit non-zero: **fail the run**, write `_failed.md` with a clear reason that completion publish failed and retry guidance (for example: `Retry npm run lock:complete -- --agent --cadence after verifying relay connectivity`), then stop. 14. Create final task log only after step 13 succeeds (scheduler-owned): - `_completed.md` MUST be created only after completion publish succeeds. - `_failed.md` is required when step 11, step 12, or step 13 fails, and should include the failure reason and next retry action. - Include `platform` in frontmatter using `AGENT_PLATFORM` (or the scheduler `--platform` value) for both `_completed.md` and `_failed.md`. 15. Commit/push behavior is delegated outside this scheduler script. - `scripts/agent/run-scheduler-cycle.mjs` does **not** run `git commit` or `git push`. - If commit/push is required for your workflow, perform it in the configured handoff agent command or a separate orchestration step. Worked post-task example (MUST order): 1. `AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily` 2. Execute `torch/prompts/daily/content-audit-agent.md` 3. `node scripts/agent/verify-run-artifacts.mjs --since --check-failure-notes` 4. `AGENT_PLATFORM=codex npm run lock:complete -- --agent content-audit-agent --cadence daily` (complete, permanent) 5. Write `task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__completed.md` Worked validation-failure example (MUST behavior): 1. `AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily` 2. Execute `torch/prompts/daily/content-audit-agent.md` 3. `node scripts/agent/verify-run-artifacts.mjs --since --check-failure-notes` passes 4. `npm run lint` exits non-zero (or `npm test` exits non-zero) 5. Write `task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__failed.md` with the failing command and reason 6. Stop the run **without** calling `npm run lock:complete -- --agent content-audit-agent --cadence daily`