15 KiB
Scheduler Flow (Single Source of Truth)
Use this document for all scheduler runs.
Canonical artifact paths
All daily/weekly prompt files must reference run artifacts using these canonical directories:
src/context/CONTEXT_<timestamp>.mdsrc/todo/TODO_<timestamp>.mdsrc/decisions/DECISIONS_<timestamp>.mdsrc/test_logs/TEST_LOG_<timestamp>.md
Prompt authors: do not use legacy unprefixed paths (context/, todo/, decisions/, test_logs/).
Shared Agent Run Contract (Required for All Spawned Agents)
Every agent prompt invoked by the schedulers (daily/weekly) MUST enforce this contract:
-
Read baseline policy files before implementation:
torch/TORCH.mdKNOWN_ISSUES.md- Canonical path note: active issues are tracked in root
KNOWN_ISSUES.md(notdocs/KNOWN_ISSUES.md) docs/agent-handoffs/README.md- Recent notes in
docs/agent-handoffs/learnings/anddocs/agent-handoffs/incidents/
-
Update run artifacts in
src/context/,src/todo/,src/decisions/, andsrc/test_logs/during the run, or explicitly document why each artifact update is not needed for that run. -
Capture reusable failures and unresolved issues:
- Record reusable failures in
docs/agent-handoffs/incidents/ - Record active unresolved reproducible items in
KNOWN_ISSUES.md
- Record reusable failures in
-
Execute memory retrieval before implementation begins:
- Run configured memory retrieval workflow before prompt execution (for example via
scheduler.memoryPolicyByCadence.<cadence>.retrieveCommand) - Retrieval command MUST call real memory services (
src/services/memory/index.js#getRelevantMemorieswithingestEventsseeding) and MUST emit deterministic markerMEMORY_RETRIEVED. - Retrieval command MUST write cadence-scoped evidence artifacts:
.scheduler-memory/retrieve-<cadence>.ok.scheduler-memory/retrieve-<cadence>.jsoncontaining operation inputs/outputs (agentId,query, seeded event count, ingested count, retrieved count).
- Agent Action: Review
.scheduler-memory/latest/<cadence>/memories.mdfor relevant context.
- Run configured memory retrieval workflow before prompt execution (for example via
-
Store memory after implementation and before completion publish:
- Agent Action: Write any new insights, learnings, or patterns to the file specified by
$SCHEDULER_MEMORY_FILE(defaulting tomemory-updates/<timestamp>__<agent>.md). - Run configured memory storage workflow after prompt execution (for example via
scheduler.memoryPolicyByCadence.<cadence>.storeCommand) - Storage command MUST call real memory services (
src/services/memory/index.js#ingestEvents, which uses ingestor/summarizer pipeline) and MUST emit deterministic markerMEMORY_STORED. - Storage command MUST ingest content from the file at
$SCHEDULER_MEMORY_FILE(or fallback tomemory-update.mdif present). - Storage command MUST write cadence-scoped evidence artifacts:
.scheduler-memory/store-<cadence>.ok.scheduler-memory/store-<cadence>.jsoncontaining operation inputs/outputs (agentId, input event count, stored count, generated summaries).
- Agent Action: Write any new insights, learnings, or patterns to the file specified by
-
Scheduler-owned completion/logging is mandatory:
- Spawned agents MUST NOT run
lock:complete. - Spawned agents MUST NOT write final
task-logs/<cadence>/<timestamp>__<agent-name>__completed.mdor__failed.mdfiles. - Scheduler performs completion publish and writes final success/failure task logs after its own validation gates.
- Spawned agents MUST NOT run
Numbered MUST Procedure
-
Set cadence variables before any command:
cadence=dailyorweeklylog_dir=task-logs/<cadence>/prompt_dir=src/prompts/<cadence>/
Note: branch naming (for example
agents/<cadence>/) is orchestration-level behavior and is not used byrun-scheduler-cycle.mjs. -
Run preflight to build the exclusion set:
if daily: `npm run lock:check:daily -- --json --quiet`; if weekly: `npm run lock:check:weekly -- --json --quiet`Canonical exclusion rule:
- Use
excludedfrom thenpm run lock:check:<cadence>JSON output. - If
excludedis unavailable, fallback to the union oflocked,paused, andcompletedfrom that same JSON payload.
Goose Desktop note:
npm run lock:check:<cadence>can emit large hermit wrapper logs. Use--json --quiet(as documented above). If the command still fails due to Goose hermit issues, apply the PATH workaround inKNOWN_ISSUES.mdbefore rerunning. - Use
-
Read policy file(s) once before the run loop. This step is conditional: if
torch/TORCH.mdis missing, continue without failing.test -f torch/TORCH.md && cat torch/TORCH.md || echo "No torch/TORCH.md found; continuing" -
Bootstrap log directories before listing files:
mkdir -p <log_dir> -
When lock health preflight is enabled (
scheduler.lockHealthPreflight: trueor envSCHEDULER_LOCK_HEALTH_PREFLIGHT=1), verify relay/query health before selecting an agent or callinglock:lock:npm run lock:health -- --cadence <cadence>- Escape hatch: set
SCHEDULER_SKIP_LOCK_HEALTH_PREFLIGHT=1to skip this check for local/offline workflows. - If preflight exits non-zero because every relay is unhealthy, write
_deferred.mdwith reasonAll relays unhealthy preflight, includeincident_signal_id, setfailure_category: lock_backend_error, stateprompt not executed, and stop before lock acquisition. - For other non-zero preflight failures, write
_failed.mdwith reasonLock backend unavailable preflight, setfailure_category: lock_backend_error, stateprompt not executed, and include:relay_listpreflight_failure_categorypreflight_stderr_excerptpreflight_stdout_excerpt
- Always include any preflight alert payload (
preflight_alerts) in the scheduler metadata for operational triage.
- Escape hatch: set
-
Find latest cadence log file, derive the previous agent, then choose the next roster agent not in exclusion set:
ls -1 <log_dir> | sort | tail -n 1Selection algorithm (MUST be followed exactly):
- Roster source:
torch/roster.jsonand the key matching<cadence>. - Let
rosterbe that ordered array andexcludedbe the set from step 2's canonical exclusion rule. - Let
latest_filebe the lexicographically last filename in<log_dir>. - Determine
previous_agentfromlatest_fileusing this precedence:- Parse YAML frontmatter from
<log_dir>/<latest_file>and use keyagentwhen present and non-empty. - Otherwise parse filename convention
<timestamp>__<agent-name>__<status>.mdand take<agent-name>.
- Parse YAML frontmatter from
- If no valid
latest_fileexists, or parsing fails, orprevious_agentis not inroster, treat as first run fallback. - First run fallback:
- Read
scheduler.firstPromptByCadence.<cadence>fromtorch-config.jsonif present. - If that agent exists in
roster, setstart_index = index(configured_agent). - Otherwise set
start_index = 0.
- Read
- Otherwise:
start_index = (index(previous_agent in roster) + 1) mod len(roster). - Round-robin scan:
- Iterate offsets
0..len(roster)-1. - Candidate index:
(start_index + offset) mod len(roster)(wrap-around required). - Choose the first candidate whose agent is not in
excluded.
- Iterate offsets
- If no candidate is eligible, execute step 7.
Worked examples:
-
Daily example
roster.daily = [audit-agent, ci-health-agent, const-refactor-agent, ...]latest_file = 2026-02-13T00-10-00Z__ci-health-agent__completed.mdexcluded = {const-refactor-agent, docs-agent}previous_agent = ci-health-agent, sostart_indexpoints toconst-refactor-agent.const-refactor-agentis excluded; skip tocontent-audit-agent.- Selection result:
content-audit-agent.
-
Weekly example
roster.weekly = [bug-reproducer-agent, changelog-agent, ..., weekly-synthesis-agent]latest_file = 2026-02-09T00-00-00Z__weekly-synthesis-agent__completed.mdexcluded = {}previous_agent = weekly-synthesis-agent(last roster entry), sostart_index = 0by wrap-around.- First candidate is
bug-reproducer-agentand is eligible. - Selection result:
bug-reproducer-agent.
- Roster source:
-
If every roster agent is excluded, write a
_failed.mdlog with:All roster tasks currently claimed by other agentsand stop. -
Claim selected agent:
AGENT_PLATFORM=<platform> \ npm run lock:lock -- --agent <agent-name> --cadence <cadence>- Exit
0: lock acquired, continue. - Exit
3: race lost/already locked, return to step 2. - Exit
2: lock backend error.- If
scheduler.strict_lockisfalse, defer the run while both budget constraints are still satisfied:degraded_lock_retry_window(ms) since first backend failure has not elapsed.max_deferralshas not been exceeded.- Record deferral metadata in scheduler run state (
attempt_count,first_failure_timestamp,backend_category, and preserved idempotency key).
- Otherwise write
_failed.mdwith reasonLock backend error, and include failure metadata fields: backend_category(classified backend failure category)lock_command(raw lock command for retry)lock_stderr_excerpt(redacted stderr snippet)lock_stdout_excerpt(redacted stdout snippet)- Include
failure_class: backend_unavailableandfailure_category: lock_backend_errorfor both deferred and failed backend-unavailable lock outcomes.
- If
- Include recommended auto-remediation text in
detail: retry window,npm run lock:health -- --cadence <cadence>, and incident runbook link. - Keep generic reason text for compatibility, but append actionable retry guidance in
detailusing the command fromlock_command.
- Exit
-
Execute
<prompt_dir>/<prompt-file>end-to-end via configured handoff command.- Scheduler automation runs
scheduler.handoffCommandByCadence.<cadence>with environment variables for cadence/agent/prompt path. - If no handoff command is configured for the cadence, write
_failed.mdand stop. - Prompt file read/parse failures should emit
failure_category: prompt_parse_error. - Prompt schema/contract failures should emit
failure_category: prompt_schema_error. - Command/handoff/validation runtime failures should emit
failure_category: execution_error.
- Scheduler automation runs
-
Confirm memory contract completion:
- Memory retrieval evidence must exist for this run (output marker and/or artifact file).
- Memory storage evidence must exist for this run (output marker and/or artifact file).
- Enforced daily/weekly commands in
torch-config.jsonrunnode --input-type=modulesnippets that callsrc/services/memory/index.jsAPIs directly:- Retrieval path:
ingestEvents(...)seed +getRelevantMemories(...)retrieval. - Storage path:
ingestEvents(...)(ingestor + summarizer path).
- Retrieval path:
- Input contract for retrieval evidence JSON:
- Required keys:
cadence,operation: "retrieve",servicePath,inputs,outputs,status: "ok". inputsmust includeagentId,query, and seededeventscount.outputsmust includeingestedCountandretrievedCount.
- Required keys:
- Input contract for storage evidence JSON:
- Required keys:
cadence,operation: "store",servicePath,inputs,outputs,status: "ok". inputsmust includeagentIdand inputeventscount.outputsmust includestoredCountand generatedsummaries.
- Required keys:
- Failure semantics for required mode:
- If retrieval/store command exits non-zero, scheduler writes
_failed.mdand stops. - If command succeeds but markers/artifacts are missing, scheduler treats this as missing memory evidence.
- If retrieval/store command exits non-zero, scheduler writes
- Prompt authors MUST keep any command changes aligned with configured markers/artifacts so scheduler evidence checks remain satisfiable.
- If
scheduler.memoryPolicyByCadence.<cadence>.mode = required, missing evidence is a hard failure. - If mode is
optional, log warning context and continue.
-
Verify required run artifacts for the current run window.
- Scheduler runs
node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes. - If artifact verification exits non-zero: write
_failed.mdand stop.
- Scheduler runs
-
Run repository checks (for example:
npm run lint).- If any validation command exits non-zero: fail the run immediately, write
_failed.mdwith the failing command and reason, and stop. - step 12 MUST NOT be executed (
lock:completeis forbidden until validation passes). - In current numbering: when step 12 fails, step 13 MUST NOT run.
- If any validation command exits non-zero: fail the run immediately, write
-
Publish completion before writing final success log:
AGENT_PLATFORM=<platform> \ npm run lock:complete -- --agent <agent-name> --cadence <cadence>(Equivalent invocation is allowed:
torch-lock complete --agent <agent-name> --cadence <cadence>.)- Exit
0: completion published successfully; continue to step 14. - Exit non-zero: fail the run, write
_failed.mdwith a clear reason that completion publish failed and retry guidance (for example:Retry npm run lock:complete -- --agent <agent-name> --cadence <cadence> after verifying relay connectivity), then stop.
- Exit
-
Create final task log only after step 13 succeeds (scheduler-owned):
_completed.mdMUST be created only after completion publish succeeds._failed.mdis required when step 11, step 12, or step 13 fails, and should include the failure reason and next retry action.- Include
platformin frontmatter usingAGENT_PLATFORM(or the scheduler--platformvalue) for both_completed.mdand_failed.md.
-
Commit/push behavior is delegated outside this scheduler script.
scripts/agent/run-scheduler-cycle.mjsdoes not rungit commitorgit push.- If commit/push is required for your workflow, perform it in the configured handoff agent command or a separate orchestration step.
Worked post-task example (MUST order):
AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily- Execute
torch/prompts/daily/content-audit-agent.md node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notesAGENT_PLATFORM=codex npm run lock:complete -- --agent content-audit-agent --cadence daily(complete, permanent)- Write
task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__completed.md
Worked validation-failure example (MUST behavior):
AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily- Execute
torch/prompts/daily/content-audit-agent.md node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notespassesnpm run lintexits non-zero (ornpm testexits non-zero)- Write
task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__failed.mdwith the failing command and reason - Stop the run without calling
npm run lock:complete -- --agent content-audit-agent --cadence daily