kircérta // 2026 // consolidated recommendations

Build Now. Harden Next. Operate Safely.

PLAN IS BUILDABLE • ADD OPS/GOV FOR LONG RUN • KEEP BACKEND AS TOOL • GATE EXTERNAL PUBLISH

Conclusion: The current plan is sufficient to start the work (verification closed loop), but if you want long-term stable operation and "you can rest assured without the computer", it is recommended to supplement: authz, SSOT concurrent writing, audit query surface, tool policy engine, failure degradation and resource leakage testing.

Summary_

BIG SECTIONS / FLOWCHART INCLUDED

static html • copyable blocks

0) Summary judgment

YES TO BUILD. ADD MODULE-6 FOR OPS/GOV.

Your Modules 1–5 are sufficient to support “from 0 to 1” construction and verification (SSOT/Backend/SSE/Webhook/Protocol/Stability). But to achieve the goal of "long-term operation + you can see the evidence even when you are away from the computer", it is recommended to complete Module 6 (Ops & Governance) and integrate permissions, concurrency consistency, audit queries and tool thresholds into the infrastructure.

You've covered it well.

Module 1: Minimum closed loop (simulate with ./ssot first)
Module 2: SSE + Dashboard (observable first)
Module 3: Webhook signature verification + idempotent idea
Module 4：ACK echo / criterion_ref / fidelity audit
Module 5: Stress test and timeout/heartbeat direction

Suggested completion (Module 6: Operation and Maintenance Governance)

authz: Unify permissions/roles and scope (approve/tools)
SSOT concurrent writing: base_sha optimistic locking + conflict handling
Audit query surface: Retrieve by task/correlation/type
tool policy engine: external_publish strong threshold + audit
Failed downgrade: missing permissions/OBS not turned on/network disconnection can be explained
Resource leak: SSE long connection + periodic screenshot/screen recording soak test

1) Suggestion list (organized version)

WHAT TO ADD WITHOUT BREAKING YOUR RULES

The following enhancements will not change your basic boundaries: Backend still only does infrastructure (reading and writing/broadcasting/sign verification/execution tools), and policies are still in Coordinator; all side effects must be auditable.

A) Identity and permissions (authn/authz)

Unify scopes: admin/observer/tool_runner
approve/reject must have admin scope
tools are graded by side_effect_level; external_publish requires approval
Write audit for each request: who/when/what/result

B) SSOT concurrency and consistency

Write carries base_sha (optimistic locking)
Conflict: retry/deny + audit_event:write_conflict
Webhook deduplication: event_id + window
Reject old events (timestamp/sequence number) to avoid rollback

C) Audit query surface (read-only)

GET /api/audit?task_id=&type=&since=
Aggregate a tool link by correlation_id
dashboard evidence panel: screenshot/video/log + hash

D) Tool governance (policy + evidence)

policy engine: external_publish denied by default
Evidence package: artifacts[] + sha256 + timestamps + links
Failure must be structured: error_code + remediation_hint
Downgrade strategy: Upload failure → retain local evidence + notification

2) Build a flow chart

MODULE 1–5 + TOOLS TRACK + MODULE 6 (OPS/GOV)

The main line maintains its original order; Tools serves as a parallel track; and Module 6 serves as an "online-level fill-in" to ensure that you can keep track of the progress through evidence and auditing even when you are away from the computer.

FLOWCHART (STATIC SVG)

NOTE: TOOLS MUST WRITE AUDIT + SSOT; EXTERNAL PUBLISH REQUIRES APPROVAL.

3) Codex instructions (complete per module, in one shot)

ONE MODULE = ONE PROMPT = ONE VERIFY REPORT

You will copy any one section below directly to Codex. Requirement: each Module must deliver runnable code + acceptance script + quantified report (JSON + MD). Our acceptance only looks at:./reports/phase_<X>_acceptance.jsonwith verify command output.

Unified hard requirements (all modules)

Must generate: ./reports/phase_<X>_acceptance.json
Must generate: ./reports/phase_<X>_acceptance.md
Required: make verify-phase-<x> or python scripts/validation/verify_phase_<x>.py
acceptance.json must contain: pass(bool), metrics, artifacts(path+sha256), how_to_run, fail_reasons[]
All key events are written to ./audit/events.jsonl (including correlation_id)
correlation_id propagation rules: Telegram(message_id) → SSOT(task.correlation_id) → tool_call.correlation_id → audit.correlation_id (regeneration at each step is prohibited to cause broken links)

Minimum field for quantified output (JSON)

{
  "phase": "A|B|C|D|E|F",
  "pass": true,
  "started_at": "ISO8601",
  "finished_at": "ISO8601",
  "how_to_run": "make verify-phase-a",
  "metrics": {"example": 123},
  "artifacts": [{"path": "...", "sha256": "...", "note": "optional"}],
  "fail_reasons": []
}

You are the Codex (executor). Please implement Module 1: SSOT + API minimum closed loop, and deliver a quantifiable acceptance report.

System boundaries (must be adhered to):
- SSOT is the only truth (simulated tonight with local ./ssot/).
- Backend only does infrastructure: reading and writing SSOT, broadcasting events, and writing audits; it does not write "intelligent judgment".

Implementation goals:
1) FastAPI：
   - GET /api/tasks
   - GET /api/tasks/{task_id}
   - POST /api/tasks/{task_id}/approve（admin token）
   - POST /api/tasks/{task_id}/reject（admin token）
2) SSOT local implementation: ./ssot/{task_id}.json
- Writes must be atomic (tmp + rename).
3) Task status view: task_engine generates status (pending/needs_review/approved/rejected).
4) Audit: ./audit/events.jsonl (records reading and writing, approve/reject, errors; each item has correlation_id).

Acceptance script (required):
- Provide make verify-phase-a (or scripts/validation/verify_phase_a.py), complete:
a) Start the service (or detect that it has been started)
b) Create a sample task (if it does not exist)
c) Continuously request detail 10 times, and after canonicalize, sha256 is required to be stable (hash_stability_pass=true)
d) After calling approve or reject, the verdict is immediately visible in the next GET
e) Statistics API latency (at least p50/p95)

Quantitative indicators (written into ./reports/phase_A_acceptance.json):
- pass (overall)
- metrics.api_p50_ms, metrics.api_p95_ms
- metrics.hash_stability_pass（bool）
- metrics.tests_passed, metrics.tests_failed
- artifacts: contains at least one sample task file and a response hash description (optional)

Deliverables:
- Runnable code + README
- ./reports/phase_A_acceptance.json & ./reports/phase_A_acceptance.md
- verify command

NOTE: Write any assumptions in the README; do not introduce external publishing capabilities.

You are the Codex (executor). Please implement Module 2: SSE + Minimal Dashboard and deliver a quantifiable acceptance report.

Implementation goals:
1) Add SSE: GET /api/events
- Maintain connection pool.
- Broadcast event when SSOT changes (approve/reject or tool results written back).
2) Minimal dashboard (static HTML is enough):
- Display tasks list and single task details.
- Subscribe to SSE auto-refresh (at least console print events also work).
3) Audit: events.jsonl records sse_client_connect/disconnect and broadcast_count.

Acceptance script (required):
- make verify-phase-b (or scripts/validation/verify_phase_b.py), complete:
a) Start the service
b) Start 3 SSE clients (scripts) and keep connected
c) Trigger 20 status changes (e.g. approve/reject or write back fields)
d) Count the arrival delay (p50/p95) of each event and count the number of disconnections

Quantitative indicators (written into ./reports/phase_B_acceptance.json):
- metrics.sse_connected_clients（>=3）
- metrics.sse_events_sent, metrics.sse_events_received
- metrics.sse_latency_p50_ms, metrics.sse_latency_p95_ms
- metrics.sse_disconnect_count

Deliverables:
- Code + README
- ./reports/phase_B_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 3: Webhook input + Telegram Ingest (as the acceptance input), and deliver a quantitative acceptance report.

Implementation goals:
1) POST /webhooks/github
- Verify HMAC-SHA256 (using environment variable WEBHOOK_SECRET).
- Support event_id deduplication (store the latest N event_ids or persist to files).
- Write audit after passing, and optionally update a task field (such as ci_status).
2) Idempotent: replaying the same event_id should not change the final state.
3) Audit: record signature_valid, event_id, dedup_hit, and write results.
4) Telegram Ingest (it is recommended to use polling first to avoid introducing dependence on the public network overnight):
- scripts/telegram_poll.py: Poll Telegram getUpdates (BOT_TOKEN) and call the backend /webhooks/telegram for new messages.
- Added POST /webhooks/telegram in the backend: structure the message into SSOT task (without strategic reasoning).
5) message → SSOT task mapping (required field):
- task_id (stable id, the first 12 digits of sha256(chat_id:message_id) available)
   - source="telegram"
   - chat_id, message_id, request_text, created_at
- correlation_id (it is recommended to be the same as task_id to ensure that the link can be restored)
   - status="pending"
6) Idempotent deduplication (required): Only one task can be created for the same (chat_id, message_id); repeated requests must not generate new tasks, but write audit_event:telegram_dedup_hit.

Acceptance script (required):
- make verify-phase-c (or scripts/validation/verify_phase_c.py), complete:
a) Send 50 bad signature requests → must be 100% rejected
b) Send 1 correct signature event and replay 10 times → the final status is consistent (idempotency_pass=true)
c) flood: 50 rps wrong signature lasts for 15 seconds, while stress testing p95 of GET /api/tasks
d) Telegram ingest: simulate the same message_id replay 10 times → only 1 task can be created (telegram_dedup_pass=true)
e) Telegram ingest latency: p95 from the ingest call to the appearance of the SSOT file (telegram_ingest_p95_ms)

Quantitative indicators (written into ./reports/phase_C_acceptance.json):
- metrics.signature_reject_rate (should=1.0)
- metrics.idempotency_pass（bool）
- metrics.flood_api_p95_ms
- metrics.audit_events_written (>= number of key events)
 - metrics.telegram_dedup_pass（bool）
 - metrics.telegram_ingest_p95_ms
 - metrics.telegram_tasks_created

Deliverables:
- Code + README
- ./reports/phase_C_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 4: Protocolized Executor/Reviewer (use fake scripts first), and deliver a quantitative acceptance report.

Implementation goals:
1) scripts/validation/fake_executor.py
- Read SSOT task JSON
- Write ACK echo (recite acceptance_criteria + declared_scope item by item)
- write task_result (contains work_log + diff_snapshot or at least work_log)
2) scripts/validation/fake_reviewer.py
- Read task_result
- Write verdict (approve/reject)
- issues must all contain criterion_ref
3) fidelity check (script or function): issues must be consistent with verdict.issues bytes when re-dispatch.

Acceptance script (required):
- make verify-phase-d (or scripts/validation/verify_phase_d.py), complete:
a) Generate a sample task with acceptance_criteria
b) Run scripts/validation/fake_executor.py → check ack.echo_match_rate==1.0
c) Run scripts/validation/fake_reviewer.py → check issues.with_criterion_ref_rate==1.0
d) Traceability: Each issue can find the corresponding record in work_log (traceability_pass=true)
e) fidelity: Bytes consistent (fidelity_pass=true)

Quantitative indicators (written into ./reports/phase_D_acceptance.json):
- metrics.ack_echo_match_rate
- metrics.issues_with_criterion_ref_rate
- metrics.traceability_pass
- metrics.fidelity_pass

Deliverables:
- Script + README
- ./reports/phase_D_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 5: stability and stress testing scripts, and deliver a quantitative acceptance report.

Implementation goals:
1) SSE soak: start N=50 clients subscribed to /api/events, run for T=10 minutes (can be shortened tonight), and measure disconnect count plus max/mean latency.
2) Concurrent load test: 100 concurrent GET /api/tasks + 10 concurrent writes (approve/reject) + 50 SSE connections.
3) Failure path: simulate at least one error and write it into audit (e.g., tool execution failure or a webhook invalid-signature flood).

Acceptance script (required):
- make verify-phase-e (or scripts/validation/verify_phase_e.py), outputs:
  a) api_p50/p95/p99_ms
  b) sse_disconnect_count, sse_latency_p95_ms
  c) number of observed error events (and written to audit)

Quantified metrics (write to ./reports/phase_E_acceptance.json):
- metrics.api_p95_ms
- metrics.sse_soak_disconnect_count
- metrics.sse_latency_p95_ms
- metrics.error_events_written

Deliverables:
- Load-test scripts + README
- ./reports/phase_E_acceptance.json & .md
- verify command

note

The Telegram acceptance entry you proposed can be an additional ingest for Module 3 (telegram_ingest), or the next step after Tools-1. But the quantified reporting mechanism (reports + verify) for each module should be fixed now.

3) Codex instructions (complete per module, in one shot)

ONE MODULE = ONE PROMPT = ONE VERIFY REPORT

Unified hard requirements (all modules)

Must generate: ./reports/phase_<X>_acceptance.json
Must generate: ./reports/phase_<X>_acceptance.md
Required: make verify-phase-<x> or python scripts/validation/verify_phase_<x>.py
acceptance.json must contain: pass(bool), metrics, artifacts(path+sha256), how_to_run, fail_reasons[]
All key events are written to ./audit/events.jsonl (including correlation_id)
correlation_id propagation rules: Telegram(message_id) → SSOT(task.correlation_id) → tool_call.correlation_id → audit.correlation_id (regeneration at each step is prohibited to cause broken links)

Minimum field for quantified output (JSON)

{
  "phase": "A|B|C|D|E|F",
  "pass": true,
  "started_at": "ISO8601",
  "finished_at": "ISO8601",
  "how_to_run": "make verify-phase-a",
  "metrics": {"example": 123},
  "artifacts": [{"path": "...", "sha256": "...", "note": "optional"}],
  "fail_reasons": []
}

You are the Codex (executor). Please implement Module 1: SSOT + API minimum closed loop, and deliver a quantifiable acceptance report.

System boundaries (must be adhered to):
- SSOT is the only truth (simulated tonight with local ./ssot/).
- Backend only does infrastructure: reading and writing SSOT, broadcasting events, and writing audits; it does not write "intelligent judgment".

Implementation goals:
1) FastAPI：
   - GET /api/tasks
   - GET /api/tasks/{task_id}
   - POST /api/tasks/{task_id}/approve（admin token）
   - POST /api/tasks/{task_id}/reject（admin token）
2) SSOT local implementation: ./ssot/{task_id}.json
- Writes must be atomic (tmp + rename).
3) Task status view: task_engine generates status (pending/needs_review/approved/rejected).
4) Audit: ./audit/events.jsonl (records reading and writing, approve/reject, errors; each item has correlation_id).

Acceptance script (required):
- Provide make verify-phase-a (or scripts/validation/verify_phase_a.py), complete:
a) Start the service (or detect that it has been started)
b) Create a sample task (if it does not exist)
c) Continuously request detail 10 times, and after canonicalize, sha256 is required to be stable (hash_stability_pass=true)
d) After calling approve or reject, the verdict is immediately visible in the next GET
e) Statistics API latency (at least p50/p95)

Quantitative indicators (written into ./reports/phase_A_acceptance.json):
- pass (overall)
- metrics.api_p50_ms, metrics.api_p95_ms
- metrics.hash_stability_pass（bool）
- metrics.tests_passed, metrics.tests_failed
- artifacts: contains at least one sample task file and a response hash description (optional)

Deliverables:
- Runnable code + README
- ./reports/phase_A_acceptance.json & ./reports/phase_A_acceptance.md
- verify command

NOTE: Write any assumptions in the README; do not introduce external publishing capabilities.

You are the Codex (executor). Please implement Module 2: SSE + Minimal Dashboard and deliver a quantifiable acceptance report.

Implementation goals:
1) Add SSE: GET /api/events
- Maintain connection pool.
- Broadcast event when SSOT changes (approve/reject or tool results written back).
2) Minimal dashboard (static HTML is enough):
- Display tasks list and single task details.
- Subscribe to SSE auto-refresh (at least console print events also work).
3) Audit: events.jsonl records sse_client_connect/disconnect and broadcast_count.

Acceptance script (required):
- make verify-phase-b (or scripts/validation/verify_phase_b.py), complete:
a) Start the service
b) Start 3 SSE clients (scripts) and keep connected
c) Trigger 20 status changes (e.g. approve/reject or write back fields)
d) Count the arrival delay (p50/p95) of each event and count the number of disconnections

Quantitative indicators (written into ./reports/phase_B_acceptance.json):
- metrics.sse_connected_clients（>=3）
- metrics.sse_events_sent, metrics.sse_events_received
- metrics.sse_latency_p50_ms, metrics.sse_latency_p95_ms
- metrics.sse_disconnect_count

Deliverables:
- Code + README
- ./reports/phase_B_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 3: Webhook input + Telegram Ingest (as the acceptance input), and deliver a quantitative acceptance report.

Implementation goals:
1) POST /webhooks/github
- Verify HMAC-SHA256 (using environment variable WEBHOOK_SECRET).
- Support event_id deduplication (store the latest N event_ids or persist to files).
- Write audit after passing, and optionally update a task field (such as ci_status).
2) Idempotent: replaying the same event_id should not change the final state.
3) Audit: record signature_valid, event_id, dedup_hit, and write results.
4) Telegram Ingest (it is recommended to use polling first to avoid introducing dependence on the public network overnight):
- scripts/telegram_poll.py: Poll Telegram getUpdates (BOT_TOKEN) and call the backend /webhooks/telegram for new messages.
- Added POST /webhooks/telegram in the backend: structure the message into SSOT task (without strategic reasoning).
5) message → SSOT task mapping (required field):
- task_id (stable id, the first 12 digits of sha256(chat_id:message_id) available)
   - source="telegram"
   - chat_id, message_id, request_text, created_at
- correlation_id (it is recommended to be the same as task_id to ensure that the link can be restored)
   - status="pending"
6) Idempotent deduplication (required): Only one task can be created for the same (chat_id, message_id); repeated requests must not generate new tasks, but write audit_event:telegram_dedup_hit.

Acceptance script (required):
- make verify-phase-c (or scripts/validation/verify_phase_c.py), complete:
a) Send 50 bad signature requests → must be 100% rejected
b) Send 1 correct signature event and replay 10 times → the final status is consistent (idempotency_pass=true)
c) flood: 50 rps wrong signature lasts for 15 seconds, while stress testing p95 of GET /api/tasks
d) Telegram ingest: simulate the same message_id replay 10 times → only 1 task can be created (telegram_dedup_pass=true)
e) Telegram ingest latency: p95 from the ingest call to the appearance of the SSOT file (telegram_ingest_p95_ms)

Quantitative indicators (written into ./reports/phase_C_acceptance.json):
- metrics.signature_reject_rate (should=1.0)
- metrics.idempotency_pass（bool）
- metrics.flood_api_p95_ms
- metrics.audit_events_written (>= number of key events)
 - metrics.telegram_dedup_pass（bool）
 - metrics.telegram_ingest_p95_ms
 - metrics.telegram_tasks_created

Deliverables:
- Code + README
- ./reports/phase_C_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 4: Protocolized Executor/Reviewer (use fake scripts first), and deliver a quantitative acceptance report.

Implementation goals:
1) scripts/validation/fake_executor.py
- Read SSOT task JSON
- Write ACK echo (recite acceptance_criteria + declared_scope item by item)
- write task_result (contains work_log + diff_snapshot or at least work_log)
2) scripts/validation/fake_reviewer.py
- Read task_result
- Write verdict (approve/reject)
- issues must all contain criterion_ref
3) fidelity check (script or function): issues must be consistent with verdict.issues bytes when re-dispatch.

Acceptance script (required):
- make verify-phase-d (or scripts/validation/verify_phase_d.py), complete:
a) Generate a sample task with acceptance_criteria
b) Run scripts/validation/fake_executor.py → check ack.echo_match_rate==1.0
c) Run scripts/validation/fake_reviewer.py → check issues.with_criterion_ref_rate==1.0
d) Traceability: Each issue can find the corresponding record in work_log (traceability_pass=true)
e) fidelity: Bytes consistent (fidelity_pass=true)

Quantitative indicators (written into ./reports/phase_D_acceptance.json):
- metrics.ack_echo_match_rate
- metrics.issues_with_criterion_ref_rate
- metrics.traceability_pass
- metrics.fidelity_pass

Deliverables:
- Script + README
- ./reports/phase_D_acceptance.json & .md
- verify command

You are the Codex (executor). Please implement Module 5: stability and stress testing scripts, and deliver a quantitative acceptance report.

Implementation goals:
1) SSE soak: start N=50 clients subscribed to /api/events, run for T=10 minutes (can be shortened tonight), and measure disconnect count plus max/mean latency.
2) Concurrent load test: 100 concurrent GET /api/tasks + 10 concurrent writes (approve/reject) + 50 SSE connections.
3) Failure path: simulate at least one error and write it into audit (e.g., tool execution failure or a webhook invalid-signature flood).

Acceptance script (required):
- make verify-phase-e (or scripts/validation/verify_phase_e.py), outputs:
  a) api_p50/p95/p99_ms
  b) sse_disconnect_count, sse_latency_p95_ms
  c) number of observed error events (and written to audit)

Quantified metrics (write to ./reports/phase_E_acceptance.json):
- metrics.api_p95_ms
- metrics.sse_soak_disconnect_count
- metrics.sse_latency_p95_ms
- metrics.error_events_written

Deliverables:
- Load-test scripts + README
- ./reports/phase_E_acceptance.json & .md
- verify command

note

3) Module 6 specifications (recommended to add)

A SMALL MODULE THAT MAKES THE SYSTEM OPERABLE

This is not about "making the backend smarter", but about solidifying governance capabilities: permissions, concurrency consistency, audit queries, and external side effect thresholds. Everything is consistent with the original security code: the infrastructure is responsible for gates and records, and the policy is still determined by the Coordinator.

Deliverables

authz middleware（scopes + audit）
SSOT writer: base_sha optimistic locking + conflict handling
audit query API: read-only retrieval
tool policy engine：external_publish gate
evidence panel: dashboard displays evidence package

acceptance

external_publish will be rejected if not approved + audit_event:policy_denied
Concurrent write conflicts are reproducible + audit_event:write_conflict
audit can review a link based on correlation_id
The dashboard can view screenshots/videos/logs and hashes with one click

You are the Codex (executor). Please add Module 6 (Ops & Governance) so that the system can run for a long time.

Hard constraints:
- Do not change the existing boundaries: the backend is still the infrastructure (tool layer), and no intelligent judgment is written; the strategy is in the Coordinator.
- But the backend must enforce permissions and policy gate (this is an infrastructure responsibility).
- All rejections/failures/conflicts must be written to audit (JSONL).

Task:
1) AuthN/AuthZ：
- Added token scopes (admin/observer/tool_runner).
- approve/reject admin only.
- /api/tools/execute requires tool_runner; external_publish additional gate (see point 3).
- Write audit: actor/scope/route/status/correlation_id for each request.

2) SSOT concurrent writing (supported by GitHub or local simulation):
- Write requests carry base_sha (or version number).
- If there is a conflict, reject it and write audit_event:write_conflict, and return retry information.

3) Tool Policy Engine：
- tool_call must contain side_effect_level.
- external_publish is denied by default; only executed if task.allow_external_publish=true and admin_approved=true.
- policy_denied must also write audit_event:policy_denied.