Files
Archivestr/torch/scheduler-flow.md
thePR0M3TH3AN cc1ba691cb update
2026-02-19 22:43:56 -05:00

15 KiB

Scheduler Flow (Single Source of Truth)

Use this document for all scheduler runs.

Canonical artifact paths

All daily/weekly prompt files must reference run artifacts using these canonical directories:

  • src/context/CONTEXT_<timestamp>.md
  • src/todo/TODO_<timestamp>.md
  • src/decisions/DECISIONS_<timestamp>.md
  • src/test_logs/TEST_LOG_<timestamp>.md

Prompt authors: do not use legacy unprefixed paths (context/, todo/, decisions/, test_logs/).

Shared Agent Run Contract (Required for All Spawned Agents)

Every agent prompt invoked by the schedulers (daily/weekly) MUST enforce this contract:

  1. Read baseline policy files before implementation:

    • torch/TORCH.md
    • KNOWN_ISSUES.md
    • Canonical path note: active issues are tracked in root KNOWN_ISSUES.md (not docs/KNOWN_ISSUES.md)
    • docs/agent-handoffs/README.md
    • Recent notes in docs/agent-handoffs/learnings/ and docs/agent-handoffs/incidents/
  2. Update run artifacts in src/context/, src/todo/, src/decisions/, and src/test_logs/ during the run, or explicitly document why each artifact update is not needed for that run.

  3. Capture reusable failures and unresolved issues:

    • Record reusable failures in docs/agent-handoffs/incidents/
    • Record active unresolved reproducible items in KNOWN_ISSUES.md
  4. Execute memory retrieval before implementation begins:

    • Run configured memory retrieval workflow before prompt execution (for example via scheduler.memoryPolicyByCadence.<cadence>.retrieveCommand)
    • Retrieval command MUST call real memory services (src/services/memory/index.js#getRelevantMemories with ingestEvents seeding) and MUST emit deterministic marker MEMORY_RETRIEVED.
    • Retrieval command MUST write cadence-scoped evidence artifacts:
      • .scheduler-memory/retrieve-<cadence>.ok
      • .scheduler-memory/retrieve-<cadence>.json containing operation inputs/outputs (agentId, query, seeded event count, ingested count, retrieved count).
    • Agent Action: Review .scheduler-memory/latest/<cadence>/memories.md for relevant context.
  5. Store memory after implementation and before completion publish:

    • Agent Action: Write any new insights, learnings, or patterns to the file specified by $SCHEDULER_MEMORY_FILE (defaulting to memory-updates/<timestamp>__<agent>.md).
    • Run configured memory storage workflow after prompt execution (for example via scheduler.memoryPolicyByCadence.<cadence>.storeCommand)
    • Storage command MUST call real memory services (src/services/memory/index.js#ingestEvents, which uses ingestor/summarizer pipeline) and MUST emit deterministic marker MEMORY_STORED.
    • Storage command MUST ingest content from the file at $SCHEDULER_MEMORY_FILE (or fallback to memory-update.md if present).
    • Storage command MUST write cadence-scoped evidence artifacts:
      • .scheduler-memory/store-<cadence>.ok
      • .scheduler-memory/store-<cadence>.json containing operation inputs/outputs (agentId, input event count, stored count, generated summaries).
  6. Scheduler-owned completion/logging is mandatory:

    • Spawned agents MUST NOT run lock:complete.
    • Spawned agents MUST NOT write final task-logs/<cadence>/<timestamp>__<agent-name>__completed.md or __failed.md files.
    • Scheduler performs completion publish and writes final success/failure task logs after its own validation gates.

Numbered MUST Procedure

  1. Set cadence variables before any command:

    • cadence = daily or weekly
    • log_dir = task-logs/<cadence>/
    • prompt_dir = src/prompts/<cadence>/

    Note: branch naming (for example agents/<cadence>/) is orchestration-level behavior and is not used by run-scheduler-cycle.mjs.

  2. Run preflight to build the exclusion set:

    if daily: `npm run lock:check:daily -- --json --quiet`; if weekly: `npm run lock:check:weekly -- --json --quiet`
    

    Canonical exclusion rule:

    • Use excluded from the npm run lock:check:<cadence> JSON output.
    • If excluded is unavailable, fallback to the union of locked, paused, and completed from that same JSON payload.

    Goose Desktop note: npm run lock:check:<cadence> can emit large hermit wrapper logs. Use --json --quiet (as documented above). If the command still fails due to Goose hermit issues, apply the PATH workaround in KNOWN_ISSUES.md before rerunning.

  3. Read policy file(s) once before the run loop. This step is conditional: if torch/TORCH.md is missing, continue without failing.

    test -f torch/TORCH.md && cat torch/TORCH.md || echo "No torch/TORCH.md found; continuing"
    
  4. Bootstrap log directories before listing files:

    mkdir -p <log_dir>
    
  5. When lock health preflight is enabled (scheduler.lockHealthPreflight: true or env SCHEDULER_LOCK_HEALTH_PREFLIGHT=1), verify relay/query health before selecting an agent or calling lock:lock:

    npm run lock:health -- --cadence <cadence>
    
    • Escape hatch: set SCHEDULER_SKIP_LOCK_HEALTH_PREFLIGHT=1 to skip this check for local/offline workflows.
    • If preflight exits non-zero because every relay is unhealthy, write _deferred.md with reason All relays unhealthy preflight, include incident_signal_id, set failure_category: lock_backend_error, state prompt not executed, and stop before lock acquisition.
    • For other non-zero preflight failures, write _failed.md with reason Lock backend unavailable preflight, set failure_category: lock_backend_error, state prompt not executed, and include:
      • relay_list
      • preflight_failure_category
      • preflight_stderr_excerpt
      • preflight_stdout_excerpt
    • Always include any preflight alert payload (preflight_alerts) in the scheduler metadata for operational triage.
  6. Find latest cadence log file, derive the previous agent, then choose the next roster agent not in exclusion set:

    ls -1 <log_dir> | sort | tail -n 1
    

    Selection algorithm (MUST be followed exactly):

    • Roster source: torch/roster.json and the key matching <cadence>.
    • Let roster be that ordered array and excluded be the set from step 2's canonical exclusion rule.
    • Let latest_file be the lexicographically last filename in <log_dir>.
    • Determine previous_agent from latest_file using this precedence:
      1. Parse YAML frontmatter from <log_dir>/<latest_file> and use key agent when present and non-empty.
      2. Otherwise parse filename convention <timestamp>__<agent-name>__<status>.md and take <agent-name>.
    • If no valid latest_file exists, or parsing fails, or previous_agent is not in roster, treat as first run fallback.
    • First run fallback:
      • Read scheduler.firstPromptByCadence.<cadence> from torch-config.json if present.
      • If that agent exists in roster, set start_index = index(configured_agent).
      • Otherwise set start_index = 0.
    • Otherwise: start_index = (index(previous_agent in roster) + 1) mod len(roster).
    • Round-robin scan:
      • Iterate offsets 0..len(roster)-1.
      • Candidate index: (start_index + offset) mod len(roster) (wrap-around required).
      • Choose the first candidate whose agent is not in excluded.
    • If no candidate is eligible, execute step 7.

    Worked examples:

    • Daily example

      • roster.daily = [audit-agent, ci-health-agent, const-refactor-agent, ...]
      • latest_file = 2026-02-13T00-10-00Z__ci-health-agent__completed.md
      • excluded = {const-refactor-agent, docs-agent}
      • previous_agent = ci-health-agent, so start_index points to const-refactor-agent.
      • const-refactor-agent is excluded; skip to content-audit-agent.
      • Selection result: content-audit-agent.
    • Weekly example

      • roster.weekly = [bug-reproducer-agent, changelog-agent, ..., weekly-synthesis-agent]
      • latest_file = 2026-02-09T00-00-00Z__weekly-synthesis-agent__completed.md
      • excluded = {}
      • previous_agent = weekly-synthesis-agent (last roster entry), so start_index = 0 by wrap-around.
      • First candidate is bug-reproducer-agent and is eligible.
      • Selection result: bug-reproducer-agent.
  7. If every roster agent is excluded, write a _failed.md log with: All roster tasks currently claimed by other agents and stop.

  8. Claim selected agent:

    AGENT_PLATFORM=<platform> \
    npm run lock:lock -- --agent <agent-name> --cadence <cadence>
    
    • Exit 0: lock acquired, continue.
    • Exit 3: race lost/already locked, return to step 2.
    • Exit 2: lock backend error.
      • If scheduler.strict_lock is false, defer the run while both budget constraints are still satisfied:
        • degraded_lock_retry_window (ms) since first backend failure has not elapsed.
        • max_deferrals has not been exceeded.
        • Record deferral metadata in scheduler run state (attempt_count, first_failure_timestamp, backend_category, and preserved idempotency key).
      • Otherwise write _failed.md with reason Lock backend error, and include failure metadata fields:
      • backend_category (classified backend failure category)
      • lock_command (raw lock command for retry)
      • lock_stderr_excerpt (redacted stderr snippet)
      • lock_stdout_excerpt (redacted stdout snippet)
      • Include failure_class: backend_unavailable and failure_category: lock_backend_error for both deferred and failed backend-unavailable lock outcomes.
    • Include recommended auto-remediation text in detail: retry window, npm run lock:health -- --cadence <cadence>, and incident runbook link.
    • Keep generic reason text for compatibility, but append actionable retry guidance in detail using the command from lock_command.
  9. Execute <prompt_dir>/<prompt-file> end-to-end via configured handoff command.

    • Scheduler automation runs scheduler.handoffCommandByCadence.<cadence> with environment variables for cadence/agent/prompt path.
    • If no handoff command is configured for the cadence, write _failed.md and stop.
    • Prompt file read/parse failures should emit failure_category: prompt_parse_error.
    • Prompt schema/contract failures should emit failure_category: prompt_schema_error.
    • Command/handoff/validation runtime failures should emit failure_category: execution_error.
  10. Confirm memory contract completion:

  • Memory retrieval evidence must exist for this run (output marker and/or artifact file).
  • Memory storage evidence must exist for this run (output marker and/or artifact file).
  • Enforced daily/weekly commands in torch-config.json run node --input-type=module snippets that call src/services/memory/index.js APIs directly:
    • Retrieval path: ingestEvents(...) seed + getRelevantMemories(...) retrieval.
    • Storage path: ingestEvents(...) (ingestor + summarizer path).
  • Input contract for retrieval evidence JSON:
    • Required keys: cadence, operation: "retrieve", servicePath, inputs, outputs, status: "ok".
    • inputs must include agentId, query, and seeded events count.
    • outputs must include ingestedCount and retrievedCount.
  • Input contract for storage evidence JSON:
    • Required keys: cadence, operation: "store", servicePath, inputs, outputs, status: "ok".
    • inputs must include agentId and input events count.
    • outputs must include storedCount and generated summaries.
  • Failure semantics for required mode:
    • If retrieval/store command exits non-zero, scheduler writes _failed.md and stops.
    • If command succeeds but markers/artifacts are missing, scheduler treats this as missing memory evidence.
  • Prompt authors MUST keep any command changes aligned with configured markers/artifacts so scheduler evidence checks remain satisfiable.
  • If scheduler.memoryPolicyByCadence.<cadence>.mode = required, missing evidence is a hard failure.
  • If mode is optional, log warning context and continue.
  1. Verify required run artifacts for the current run window.

    • Scheduler runs node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes.
    • If artifact verification exits non-zero: write _failed.md and stop.
  2. Run repository checks (for example: npm run lint).

    • If any validation command exits non-zero: fail the run immediately, write _failed.md with the failing command and reason, and stop.
    • step 12 MUST NOT be executed (lock:complete is forbidden until validation passes).
    • In current numbering: when step 12 fails, step 13 MUST NOT run.
  3. Publish completion before writing final success log:

    AGENT_PLATFORM=<platform> \
    npm run lock:complete -- --agent <agent-name> --cadence <cadence>
    

    (Equivalent invocation is allowed: torch-lock complete --agent <agent-name> --cadence <cadence>.)

    • Exit 0: completion published successfully; continue to step 14.
    • Exit non-zero: fail the run, write _failed.md with a clear reason that completion publish failed and retry guidance (for example: Retry npm run lock:complete -- --agent <agent-name> --cadence <cadence> after verifying relay connectivity), then stop.
  4. Create final task log only after step 13 succeeds (scheduler-owned):

    • _completed.md MUST be created only after completion publish succeeds.
    • _failed.md is required when step 11, step 12, or step 13 fails, and should include the failure reason and next retry action.
    • Include platform in frontmatter using AGENT_PLATFORM (or the scheduler --platform value) for both _completed.md and _failed.md.
  5. Commit/push behavior is delegated outside this scheduler script.

    • scripts/agent/run-scheduler-cycle.mjs does not run git commit or git push.
    • If commit/push is required for your workflow, perform it in the configured handoff agent command or a separate orchestration step.

Worked post-task example (MUST order):

  1. AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily
  2. Execute torch/prompts/daily/content-audit-agent.md
  3. node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes
  4. AGENT_PLATFORM=codex npm run lock:complete -- --agent content-audit-agent --cadence daily (complete, permanent)
  5. Write task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__completed.md

Worked validation-failure example (MUST behavior):

  1. AGENT_PLATFORM=codex npm run lock:lock -- --agent content-audit-agent --cadence daily
  2. Execute torch/prompts/daily/content-audit-agent.md
  3. node scripts/agent/verify-run-artifacts.mjs --since <run-start-iso> --check-failure-notes passes
  4. npm run lint exits non-zero (or npm test exits non-zero)
  5. Write task-logs/daily/2026-02-14T10-00-00Z__content-audit-agent__failed.md with the failing command and reason
  6. Stop the run without calling npm run lock:complete -- --agent content-audit-agent --cadence daily