Scheduler Flow (Single Source of Truth)

Use this document for all scheduler runs.

Canonical artifact paths

All daily/weekly prompt files must reference run artifacts using these canonical directories:

src/context/CONTEXT_<timestamp>.md
src/todo/TODO_<timestamp>.md
src/decisions/DECISIONS_<timestamp>.md
src/test_logs/TEST_LOG_<timestamp>.md

Prompt authors: do not use legacy unprefixed paths (context/, todo/, decisions/, test_logs/).

Shared Agent Run Contract (Required for All Spawned Agents)

Every agent prompt invoked by the schedulers (daily/weekly) MUST enforce this contract:

Read baseline policy files before implementation:
- torch/TORCH.md
- KNOWN_ISSUES.md
- Canonical path note: active issues are tracked in root KNOWN_ISSUES.md (not docs/KNOWN_ISSUES.md)
- docs/agent-handoffs/README.md
- Recent notes in docs/agent-handoffs/learnings/ and docs/agent-handoffs/incidents/
Update run artifacts in src/context/, src/todo/, src/decisions/, and src/test_logs/ during the run, or explicitly document why each artifact update is not needed for that run.
Capture reusable failures and unresolved issues:
- Record reusable failures in docs/agent-handoffs/incidents/
- Record active unresolved reproducible items in KNOWN_ISSUES.md
Execute memory retrieval before implementation begins:
- Run configured memory retrieval workflow before prompt execution (for example via scheduler.memoryPolicyByCadence.<cadence>.retrieveCommand)
- Retrieval command MUST call real memory services (src/services/memory/index.js#getRelevantMemories with ingestEvents seeding) and MUST emit deterministic marker MEMORY_RETRIEVED.
- Retrieval command MUST write cadence-scoped evidence artifacts:
  - .scheduler-memory/retrieve-<cadence>.ok
  - .scheduler-memory/retrieve-<cadence>.json containing operation inputs/outputs (agentId, query, seeded event count, ingested count, retrieved count).
- Agent Action: Review .scheduler-memory/latest/<cadence>/memories.md for relevant context.
Store memory after implementation and before completion publish:
- Agent Action: Write any new insights, learnings, or patterns to the file specified by $SCHEDULER_MEMORY_FILE (defaulting to memory-updates/<timestamp>__<agent>.md).
- Run configured memory storage workflow after prompt execution (for example via scheduler.memoryPolicyByCadence.<cadence>.storeCommand)
- Storage command MUST call real memory services (src/services/memory/index.js#ingestEvents, which uses ingestor/summarizer pipeline) and MUST emit deterministic marker MEMORY_STORED.
- Storage command MUST ingest content from the file at $SCHEDULER_MEMORY_FILE (or fallback to memory-update.md if present).
- Storage command MUST write cadence-scoped evidence artifacts:
  - .scheduler-memory/store-<cadence>.ok
  - .scheduler-memory/store-<cadence>.json containing operation inputs/outputs (agentId, input event count, stored count, generated summaries).
Scheduler-owned completion/logging is mandatory:
- Spawned agents MUST NOT run lock:complete.
- Spawned agents MUST NOT write final task-logs/<cadence>/<timestamp>__<agent-name>__completed.md or __failed.md files.
- Scheduler performs completion publish and writes final success/failure task logs after its own validation gates.

Numbered MUST Procedure

Set cadence variables before any command:
- cadence = daily or weekly
- log_dir = task-logs/<cadence>/
- prompt_dir = src/prompts/<cadence>/
Note: branch naming (for example agents/<cadence>/) is orchestration-level behavior and is not used by run-scheduler-cycle.mjs.
Run preflight to build the exclusion set:
```
if daily: `npm run lock:check:daily -- --json --quiet`; if weekly: `npm run lock:check:weekly -- --json --quiet`
```
Canonical exclusion rule:
- Use excluded from the npm run lock:check:<cadence> JSON output.
- If excluded is unavailable, fallback to the union of locked, paused, and completed from that same JSON payload.
Goose Desktop note: npm run lock:check:<cadence> can emit large hermit wrapper logs. Use --json --quiet (as documented above). If the command still fails due to Goose hermit issues, apply the PATH workaround in KNOWN_ISSUES.md before rerunning.
Read policy file(s) once before the run loop. This step is conditional: if torch/TORCH.md is missing, continue without failing.
```
test -f torch/TORCH.md && cat torch/TORCH.md || echo "No torch/TORCH.md found; continuing"
```
Bootstrap log directories before listing files:
```
mkdir -p <log_dir>
```
When lock health preflight is enabled (scheduler.lockHealthPreflight: true or env SCHEDULER_LOCK_HEALTH_PREFLIGHT=1), verify relay/query health before selecting an agent or calling lock:lock:
```
npm run lock:health -- --cadence <cadence>
```
- Escape hatch: set SCHEDULER_SKIP_LOCK_HEALTH_PREFLIGHT=1 to skip this check for local/offline workflows.
- If preflight exits non-zero because every relay is unhealthy, write _deferred.md with reason All relays unhealthy preflight, include incident_signal_id, set failure_category: lock_backend_error, state prompt not executed, and stop before lock acquisition.
- For other non-zero preflight failures, write _failed.md with reason Lock backend unavailable preflight, set failure_category: lock_backend_error, state prompt not executed, and include:
  - relay_list
  - preflight_failure_category
  - preflight_stderr_excerpt
  - preflight_stdout_excerpt
- Always include any preflight alert payload (preflight_alerts) in the scheduler metadata for operational triage.
Find latest cadence log file, derive the previous agent, then choose the next roster agent not in exclusion set:
```
ls -1 <log_dir> | sort | tail -n 1
```
Selection algorithm (MUST be followed exactly):
- Roster source: torch/roster.json and the key matching <cadence>.
- Let roster be that ordered array and excluded be the set from step 2's canonical exclusion rule.
- Let latest_file be the lexicographically last filename in <log_dir>.
- Determine previous_agent from latest_file using this precedence:
  1. Parse YAML frontmatter from <log_dir>/<latest_file> and use key agent when present and non-empty.
  2. Otherwise parse filename convention <timestamp>__<agent-name>__<status>.md and take <agent-name>.
- If no valid latest_file exists, or parsing fails, or previous_agent is not in roster, treat as first run fallback.
- First run fallback:
  - Read scheduler.firstPromptByCadence.<cadence> from torch-config.json if present.
  - If that agent exists in roster, set start_index = index(configured_agent).
  - Otherwise set start_index = 0.
- Otherwise: start_index = (index(previous_agent in roster) + 1) mod len(roster).
- Round-robin scan:
  - Iterate offsets 0..len(roster)-1.
  - Candidate index: (start_index + offset) mod len(roster) (wrap-around required).
  - Choose the first candidate whose agent is not in excluded.
- If no candidate is eligible, execute step 7.
Worked examples:
- Daily example
  - roster.daily = [audit-agent, ci-health-agent, const-refactor-agent, ...]
  - latest_file = 2026-02-13T00-10-00Z__ci-health-agent__completed.md
  - excluded = {const-refactor-agent, docs-agent}
  - previous_agent = ci-health-agent, so start_index points to const-refactor-agent.
  - const-refactor-agent is excluded; skip to content-audit-agent.
  - Selection result: content-audit-agent.
- Weekly example
  - roster.weekly = [bug-reproducer-agent, changelog-agent, ..., weekly-synthesis-agent]
  - latest_file = 2026-02-09T00-00-00Z__weekly-synthesis-agent__completed.md
  - excluded = {}
  - previous_agent = weekly-synthesis-agent (last roster entry), so start_index = 0 by wrap-around.
  - First candidate is bug-reproducer-agent and is eligible.
  - Selection result: bug-reproducer-agent.
If every roster agent is excluded, write a _failed.md log with: All roster tasks currently claimed by other agents and stop.
Claim selected agent:
```
AGENT_PLATFORM=<platform> \
npm run lock:lock -- --agent <agent-name> --cadence <cadence>
```
- Exit 0: lock acquired, continue.
- Exit 3: race lost/already locked, return to step 2.
- Exit 2: lock backend error.
  - If scheduler.strict_lock is false, defer the run while both budget constraints are still satisfied:
    - degraded_lock_retry_window (ms) since first backend failure has not elapsed.
    - max_deferrals has not been exceeded.
    - Record deferral metadata in scheduler run state (attempt_count, first_failure_timestamp, backend_category, and preserved idempotency key).
  - Otherwise write _failed.md with reason Lock backend error, and include failure metadata fields:
  - backend_category (classified backend failure category)
  - lock_command (raw lock command for retry)
  - lock_stderr_excerpt (redacted stderr snippet)
  - lock_stdout_excerpt (redacted stdout snippet)
  - Include failure_class: backend_unavailable and failure_category: lock_backend_error for both deferred and failed backend-unavailable lock outcomes.
- Include recommended auto-remediation text in detail: retry window, npm run lock:health -- --cadence <cadence>, and incident runbook link.
- Keep generic reason text for compatibility, but append actionable retry guidance in detail using the command from lock_command.
Execute <prompt_dir>/<prompt-file> end-to-end via configured handoff command.
- Scheduler automation runs scheduler.handoffCommandByCadence.<cadence> with environment variables for cadence/agent/prompt path.
- If no handoff command is configured for the cadence, write _failed.md and stop.
- Prompt file read/parse failures should emit failure_category: prompt_parse_error.
- Prompt schema/contract failures should emit failure_category: prompt_schema_error.
- Command/handoff/validation runtime failures should emit failure_category: execution_error.
Confirm memory contract completion:

Memory retrieval evidence must exist for this run (output marker and/or artifact file).
Memory storage evidence must exist for this run (output marker and/or artifact file).
Enforced daily/weekly commands in torch-config.json run node --input-type=module snippets that call src/services/memory/index.js APIs directly:
- Retrieval path: ingestEvents(...) seed + getRelevantMemories(...) retrieval.
- Storage path: ingestEvents(...) (ingestor + summarizer path).
Input contract for retrieval evidence JSON:
- Required keys: cadence, operation: "retrieve", servicePath, inputs, outputs, status: "ok".
- inputs must include agentId, query, and seeded events count.
- outputs must include ingestedCount and retrievedCount.
Input contract for storage evidence JSON:
- Required keys: cadence, operation: "store", servicePath, inputs, outputs, status: "ok".
- inputs must include agentId and input events count.
- outputs must include storedCount and generated summaries.
Failure semantics for required mode:
- If retrieval/store command exits non-zero, scheduler writes _failed.md and stops.
- If command succeeds but markers/artifacts are missing, scheduler treats this as missing memory evidence.
Prompt authors MUST keep any command changes aligned with configured markers/artifacts so scheduler evidence checks remain satisfiable.
If scheduler.memoryPolicyByCadence.<cadence>.mode = required, missing evidence is a hard failure.
If mode is optional, log warning context and continue.

Verify required run artifacts for the current run window.
- Scheduler runs node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes.
- If artifact verification exits non-zero: write _failed.md and stop.
Run repository checks (for example: npm run lint).
- If any validation command exits non-zero: fail the run immediately, write _failed.md with the failing command and reason, and stop.
- step 12 MUST NOT be executed (lock:complete is forbidden until validation passes).
- In current numbering: when step 12 fails, step 13 MUST NOT run.
Publish completion before writing final success log:
```
AGENT_PLATFORM=<platform> \
npm run lock:complete -- --agent <agent-name> --cadence <cadence>
```
(Equivalent invocation is allowed: torch-lock complete --agent <agent-name> --cadence <cadence>.)
- Exit 0: completion published successfully; continue to step 14.
- Exit non-zero: fail the run, write _failed.md with a clear reason that completion publish failed and retry guidance (for example: Retry npm run lock:complete -- --agent <agent-name> --cadence <cadence> after verifying relay connectivity), then stop.
Create final task log only after step 13 succeeds (scheduler-owned):
- _completed.md MUST be created only after completion publish succeeds.
- _failed.md is required when step 11, step 12, or step 13 fails, and should include the failure reason and next retry action.
- Include platform in frontmatter using AGENT_PLATFORM (or the scheduler --platform value) for both _completed.md and _failed.md.
Commit/push behavior is delegated outside this scheduler script.
- scripts/agent/run-scheduler-cycle.mjs does not run git commit or git push.
- If commit/push is required for your workflow, perform it in the configured handoff agent command or a separate orchestration step.

Worked post-task example (MUST order):

AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily
Execute torch/prompts/daily/content-audit-agent.md
node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes
AGENT_PLATFORM=codex npm run lock:complete -- --agent content-audit-agent --cadence daily (complete, permanent)
Write task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__completed.md

Worked validation-failure example (MUST behavior):

AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily
Execute torch/prompts/daily/content-audit-agent.md
node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes passes
npm run lint exits non-zero (or npm test exits non-zero)
Write task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__failed.md with the failing command and reason
Stop the run without calling npm run lock:complete -- --agent content-audit-agent --cadence daily

15 KiB Raw Permalink Blame History

Scheduler Flow (Single Source of Truth)

Canonical artifact paths

Shared Agent Run Contract (Required for All Spawned Agents)

Numbered MUST Procedure

15 KiB

Raw Permalink Blame History