3) Codex instructions (complete per module, in one shot)
ONE MODULE = ONE PROMPT = ONE VERIFY REPORT
You will copy any one section below directly to Codex. Requirement: each Module must deliver runnable code + acceptance script + quantified report (JSON + MD).
Our acceptance only looks at:./reports/phase_<X>_acceptance.jsonwith verify command output.
Unified hard requirements (all modules)
- Must generate: ./reports/phase_<X>_acceptance.json
- Must generate: ./reports/phase_<X>_acceptance.md
- Required: make verify-phase-<x> or python scripts/validation/verify_phase_<x>.py
- acceptance.json must contain: pass(bool), metrics, artifacts(path+sha256), how_to_run, fail_reasons[]
- All key events are written to ./audit/events.jsonl (including correlation_id)
- correlation_id propagation rules: Telegram(message_id) → SSOT(task.correlation_id) → tool_call.correlation_id → audit.correlation_id (regeneration at each step is prohibited to cause broken links)
Minimum field for quantified output (JSON)
{
"phase": "A|B|C|D|E|F",
"pass": true,
"started_at": "ISO8601",
"finished_at": "ISO8601",
"how_to_run": "make verify-phase-a",
"metrics": {"example": 123},
"artifacts": [{"path": "...", "sha256": "...", "note": "optional"}],
"fail_reasons": []
}
You are the Codex (executor). Please implement Module 1: SSOT + API minimum closed loop, and deliver a quantifiable acceptance report.
System boundaries (must be adhered to):
- SSOT is the only truth (simulated tonight with local ./ssot/).
- Backend only does infrastructure: reading and writing SSOT, broadcasting events, and writing audits; it does not write "intelligent judgment".
Implementation goals:
1) FastAPI:
- GET /api/tasks
- GET /api/tasks/{task_id}
- POST /api/tasks/{task_id}/approve(admin token)
- POST /api/tasks/{task_id}/reject(admin token)
2) SSOT local implementation: ./ssot/{task_id}.json
- Writes must be atomic (tmp + rename).
3) Task status view: task_engine generates status (pending/needs_review/approved/rejected).
4) Audit: ./audit/events.jsonl (records reading and writing, approve/reject, errors; each item has correlation_id).
Acceptance script (required):
- Provide make verify-phase-a (or scripts/validation/verify_phase_a.py), complete:
a) Start the service (or detect that it has been started)
b) Create a sample task (if it does not exist)
c) Continuously request detail 10 times, and after canonicalize, sha256 is required to be stable (hash_stability_pass=true)
d) After calling approve or reject, the verdict is immediately visible in the next GET
e) Statistics API latency (at least p50/p95)
Quantitative indicators (written into ./reports/phase_A_acceptance.json):
- pass (overall)
- metrics.api_p50_ms, metrics.api_p95_ms
- metrics.hash_stability_pass(bool)
- metrics.tests_passed, metrics.tests_failed
- artifacts: contains at least one sample task file and a response hash description (optional)
Deliverables:
- Runnable code + README
- ./reports/phase_A_acceptance.json & ./reports/phase_A_acceptance.md
- verify command
NOTE: Write any assumptions in the README; do not introduce external publishing capabilities.
You are the Codex (executor). Please implement Module 2: SSE + Minimal Dashboard and deliver a quantifiable acceptance report.
Implementation goals:
1) Add SSE: GET /api/events
- Maintain connection pool.
- Broadcast event when SSOT changes (approve/reject or tool results written back).
2) Minimal dashboard (static HTML is enough):
- Display tasks list and single task details.
- Subscribe to SSE auto-refresh (at least console print events also work).
3) Audit: events.jsonl records sse_client_connect/disconnect and broadcast_count.
Acceptance script (required):
- make verify-phase-b (or scripts/validation/verify_phase_b.py), complete:
a) Start the service
b) Start 3 SSE clients (scripts) and keep connected
c) Trigger 20 status changes (e.g. approve/reject or write back fields)
d) Count the arrival delay (p50/p95) of each event and count the number of disconnections
Quantitative indicators (written into ./reports/phase_B_acceptance.json):
- metrics.sse_connected_clients(>=3)
- metrics.sse_events_sent, metrics.sse_events_received
- metrics.sse_latency_p50_ms, metrics.sse_latency_p95_ms
- metrics.sse_disconnect_count
Deliverables:
- Code + README
- ./reports/phase_B_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 3: Webhook input + Telegram Ingest (as the acceptance input), and deliver a quantitative acceptance report.
Implementation goals:
1) POST /webhooks/github
- Verify HMAC-SHA256 (using environment variable WEBHOOK_SECRET).
- Support event_id deduplication (store the latest N event_ids or persist to files).
- Write audit after passing, and optionally update a task field (such as ci_status).
2) Idempotent: replaying the same event_id should not change the final state.
3) Audit: record signature_valid, event_id, dedup_hit, and write results.
4) Telegram Ingest (it is recommended to use polling first to avoid introducing dependence on the public network overnight):
- scripts/telegram_poll.py: Poll Telegram getUpdates (BOT_TOKEN) and call the backend /webhooks/telegram for new messages.
- Added POST /webhooks/telegram in the backend: structure the message into SSOT task (without strategic reasoning).
5) message → SSOT task mapping (required field):
- task_id (stable id, the first 12 digits of sha256(chat_id:message_id) available)
- source="telegram"
- chat_id, message_id, request_text, created_at
- correlation_id (it is recommended to be the same as task_id to ensure that the link can be restored)
- status="pending"
6) Idempotent deduplication (required): Only one task can be created for the same (chat_id, message_id); repeated requests must not generate new tasks, but write audit_event:telegram_dedup_hit.
Acceptance script (required):
- make verify-phase-c (or scripts/validation/verify_phase_c.py), complete:
a) Send 50 bad signature requests → must be 100% rejected
b) Send 1 correct signature event and replay 10 times → the final status is consistent (idempotency_pass=true)
c) flood: 50 rps wrong signature lasts for 15 seconds, while stress testing p95 of GET /api/tasks
d) Telegram ingest: simulate the same message_id replay 10 times → only 1 task can be created (telegram_dedup_pass=true)
e) Telegram ingest latency: p95 from the ingest call to the appearance of the SSOT file (telegram_ingest_p95_ms)
Quantitative indicators (written into ./reports/phase_C_acceptance.json):
- metrics.signature_reject_rate (should=1.0)
- metrics.idempotency_pass(bool)
- metrics.flood_api_p95_ms
- metrics.audit_events_written (>= number of key events)
- metrics.telegram_dedup_pass(bool)
- metrics.telegram_ingest_p95_ms
- metrics.telegram_tasks_created
Deliverables:
- Code + README
- ./reports/phase_C_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 4: Protocolized Executor/Reviewer (use fake scripts first), and deliver a quantitative acceptance report.
Implementation goals:
1) scripts/validation/fake_executor.py
- Read SSOT task JSON
- Write ACK echo (recite acceptance_criteria + declared_scope item by item)
- write task_result (contains work_log + diff_snapshot or at least work_log)
2) scripts/validation/fake_reviewer.py
- Read task_result
- Write verdict (approve/reject)
- issues must all contain criterion_ref
3) fidelity check (script or function): issues must be consistent with verdict.issues bytes when re-dispatch.
Acceptance script (required):
- make verify-phase-d (or scripts/validation/verify_phase_d.py), complete:
a) Generate a sample task with acceptance_criteria
b) Run scripts/validation/fake_executor.py → check ack.echo_match_rate==1.0
c) Run scripts/validation/fake_reviewer.py → check issues.with_criterion_ref_rate==1.0
d) Traceability: Each issue can find the corresponding record in work_log (traceability_pass=true)
e) fidelity: Bytes consistent (fidelity_pass=true)
Quantitative indicators (written into ./reports/phase_D_acceptance.json):
- metrics.ack_echo_match_rate
- metrics.issues_with_criterion_ref_rate
- metrics.traceability_pass
- metrics.fidelity_pass
Deliverables:
- Script + README
- ./reports/phase_D_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 5: stability and stress testing scripts, and deliver a quantitative acceptance report.
Implementation goals:
1) SSE soak: start N=50 clients subscribed to /api/events, run for T=10 minutes (can be shortened tonight), and measure disconnect count plus max/mean latency.
2) Concurrent load test: 100 concurrent GET /api/tasks + 10 concurrent writes (approve/reject) + 50 SSE connections.
3) Failure path: simulate at least one error and write it into audit (e.g., tool execution failure or a webhook invalid-signature flood).
Acceptance script (required):
- make verify-phase-e (or scripts/validation/verify_phase_e.py), outputs:
a) api_p50/p95/p99_ms
b) sse_disconnect_count, sse_latency_p95_ms
c) number of observed error events (and written to audit)
Quantified metrics (write to ./reports/phase_E_acceptance.json):
- metrics.api_p95_ms
- metrics.sse_soak_disconnect_count
- metrics.sse_latency_p95_ms
- metrics.error_events_written
Deliverables:
- Load-test scripts + README
- ./reports/phase_E_acceptance.json & .md
- verify command
note
The Telegram acceptance entry you proposed can be an additional ingest for Module 3 (telegram_ingest), or the next step after Tools-1.
But the quantified reporting mechanism (reports + verify) for each module should be fixed now.
3) Codex instructions (complete per module, in one shot)
ONE MODULE = ONE PROMPT = ONE VERIFY REPORT
You will copy any one section below directly to Codex. Requirement: each Module must deliver runnable code + acceptance script + quantified report (JSON + MD).
Our acceptance only looks at:./reports/phase_<X>_acceptance.jsonwith verify command output.
Unified hard requirements (all modules)
- Must generate: ./reports/phase_<X>_acceptance.json
- Must generate: ./reports/phase_<X>_acceptance.md
- Required: make verify-phase-<x> or python scripts/validation/verify_phase_<x>.py
- acceptance.json must contain: pass(bool), metrics, artifacts(path+sha256), how_to_run, fail_reasons[]
- All key events are written to ./audit/events.jsonl (including correlation_id)
- correlation_id propagation rules: Telegram(message_id) → SSOT(task.correlation_id) → tool_call.correlation_id → audit.correlation_id (regeneration at each step is prohibited to cause broken links)
Minimum field for quantified output (JSON)
{
"phase": "A|B|C|D|E|F",
"pass": true,
"started_at": "ISO8601",
"finished_at": "ISO8601",
"how_to_run": "make verify-phase-a",
"metrics": {"example": 123},
"artifacts": [{"path": "...", "sha256": "...", "note": "optional"}],
"fail_reasons": []
}
You are the Codex (executor). Please implement Module 1: SSOT + API minimum closed loop, and deliver a quantifiable acceptance report.
System boundaries (must be adhered to):
- SSOT is the only truth (simulated tonight with local ./ssot/).
- Backend only does infrastructure: reading and writing SSOT, broadcasting events, and writing audits; it does not write "intelligent judgment".
Implementation goals:
1) FastAPI:
- GET /api/tasks
- GET /api/tasks/{task_id}
- POST /api/tasks/{task_id}/approve(admin token)
- POST /api/tasks/{task_id}/reject(admin token)
2) SSOT local implementation: ./ssot/{task_id}.json
- Writes must be atomic (tmp + rename).
3) Task status view: task_engine generates status (pending/needs_review/approved/rejected).
4) Audit: ./audit/events.jsonl (records reading and writing, approve/reject, errors; each item has correlation_id).
Acceptance script (required):
- Provide make verify-phase-a (or scripts/validation/verify_phase_a.py), complete:
a) Start the service (or detect that it has been started)
b) Create a sample task (if it does not exist)
c) Continuously request detail 10 times, and after canonicalize, sha256 is required to be stable (hash_stability_pass=true)
d) After calling approve or reject, the verdict is immediately visible in the next GET
e) Statistics API latency (at least p50/p95)
Quantitative indicators (written into ./reports/phase_A_acceptance.json):
- pass (overall)
- metrics.api_p50_ms, metrics.api_p95_ms
- metrics.hash_stability_pass(bool)
- metrics.tests_passed, metrics.tests_failed
- artifacts: contains at least one sample task file and a response hash description (optional)
Deliverables:
- Runnable code + README
- ./reports/phase_A_acceptance.json & ./reports/phase_A_acceptance.md
- verify command
NOTE: Write any assumptions in the README; do not introduce external publishing capabilities.
You are the Codex (executor). Please implement Module 2: SSE + Minimal Dashboard and deliver a quantifiable acceptance report.
Implementation goals:
1) Add SSE: GET /api/events
- Maintain connection pool.
- Broadcast event when SSOT changes (approve/reject or tool results written back).
2) Minimal dashboard (static HTML is enough):
- Display tasks list and single task details.
- Subscribe to SSE auto-refresh (at least console print events also work).
3) Audit: events.jsonl records sse_client_connect/disconnect and broadcast_count.
Acceptance script (required):
- make verify-phase-b (or scripts/validation/verify_phase_b.py), complete:
a) Start the service
b) Start 3 SSE clients (scripts) and keep connected
c) Trigger 20 status changes (e.g. approve/reject or write back fields)
d) Count the arrival delay (p50/p95) of each event and count the number of disconnections
Quantitative indicators (written into ./reports/phase_B_acceptance.json):
- metrics.sse_connected_clients(>=3)
- metrics.sse_events_sent, metrics.sse_events_received
- metrics.sse_latency_p50_ms, metrics.sse_latency_p95_ms
- metrics.sse_disconnect_count
Deliverables:
- Code + README
- ./reports/phase_B_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 3: Webhook input + Telegram Ingest (as the acceptance input), and deliver a quantitative acceptance report.
Implementation goals:
1) POST /webhooks/github
- Verify HMAC-SHA256 (using environment variable WEBHOOK_SECRET).
- Support event_id deduplication (store the latest N event_ids or persist to files).
- Write audit after passing, and optionally update a task field (such as ci_status).
2) Idempotent: replaying the same event_id should not change the final state.
3) Audit: record signature_valid, event_id, dedup_hit, and write results.
4) Telegram Ingest (it is recommended to use polling first to avoid introducing dependence on the public network overnight):
- scripts/telegram_poll.py: Poll Telegram getUpdates (BOT_TOKEN) and call the backend /webhooks/telegram for new messages.
- Added POST /webhooks/telegram in the backend: structure the message into SSOT task (without strategic reasoning).
5) message → SSOT task mapping (required field):
- task_id (stable id, the first 12 digits of sha256(chat_id:message_id) available)
- source="telegram"
- chat_id, message_id, request_text, created_at
- correlation_id (it is recommended to be the same as task_id to ensure that the link can be restored)
- status="pending"
6) Idempotent deduplication (required): Only one task can be created for the same (chat_id, message_id); repeated requests must not generate new tasks, but write audit_event:telegram_dedup_hit.
Acceptance script (required):
- make verify-phase-c (or scripts/validation/verify_phase_c.py), complete:
a) Send 50 bad signature requests → must be 100% rejected
b) Send 1 correct signature event and replay 10 times → the final status is consistent (idempotency_pass=true)
c) flood: 50 rps wrong signature lasts for 15 seconds, while stress testing p95 of GET /api/tasks
d) Telegram ingest: simulate the same message_id replay 10 times → only 1 task can be created (telegram_dedup_pass=true)
e) Telegram ingest latency: p95 from the ingest call to the appearance of the SSOT file (telegram_ingest_p95_ms)
Quantitative indicators (written into ./reports/phase_C_acceptance.json):
- metrics.signature_reject_rate (should=1.0)
- metrics.idempotency_pass(bool)
- metrics.flood_api_p95_ms
- metrics.audit_events_written (>= number of key events)
- metrics.telegram_dedup_pass(bool)
- metrics.telegram_ingest_p95_ms
- metrics.telegram_tasks_created
Deliverables:
- Code + README
- ./reports/phase_C_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 4: Protocolized Executor/Reviewer (use fake scripts first), and deliver a quantitative acceptance report.
Implementation goals:
1) scripts/validation/fake_executor.py
- Read SSOT task JSON
- Write ACK echo (recite acceptance_criteria + declared_scope item by item)
- write task_result (contains work_log + diff_snapshot or at least work_log)
2) scripts/validation/fake_reviewer.py
- Read task_result
- Write verdict (approve/reject)
- issues must all contain criterion_ref
3) fidelity check (script or function): issues must be consistent with verdict.issues bytes when re-dispatch.
Acceptance script (required):
- make verify-phase-d (or scripts/validation/verify_phase_d.py), complete:
a) Generate a sample task with acceptance_criteria
b) Run scripts/validation/fake_executor.py → check ack.echo_match_rate==1.0
c) Run scripts/validation/fake_reviewer.py → check issues.with_criterion_ref_rate==1.0
d) Traceability: Each issue can find the corresponding record in work_log (traceability_pass=true)
e) fidelity: Bytes consistent (fidelity_pass=true)
Quantitative indicators (written into ./reports/phase_D_acceptance.json):
- metrics.ack_echo_match_rate
- metrics.issues_with_criterion_ref_rate
- metrics.traceability_pass
- metrics.fidelity_pass
Deliverables:
- Script + README
- ./reports/phase_D_acceptance.json & .md
- verify command
You are the Codex (executor). Please implement Module 5: stability and stress testing scripts, and deliver a quantitative acceptance report.
Implementation goals:
1) SSE soak: start N=50 clients subscribed to /api/events, run for T=10 minutes (can be shortened tonight), and measure disconnect count plus max/mean latency.
2) Concurrent load test: 100 concurrent GET /api/tasks + 10 concurrent writes (approve/reject) + 50 SSE connections.
3) Failure path: simulate at least one error and write it into audit (e.g., tool execution failure or a webhook invalid-signature flood).
Acceptance script (required):
- make verify-phase-e (or scripts/validation/verify_phase_e.py), outputs:
a) api_p50/p95/p99_ms
b) sse_disconnect_count, sse_latency_p95_ms
c) number of observed error events (and written to audit)
Quantified metrics (write to ./reports/phase_E_acceptance.json):
- metrics.api_p95_ms
- metrics.sse_soak_disconnect_count
- metrics.sse_latency_p95_ms
- metrics.error_events_written
Deliverables:
- Load-test scripts + README
- ./reports/phase_E_acceptance.json & .md
- verify command
note
The Telegram acceptance entry you proposed can be an additional ingest for Module 3 (telegram_ingest), or the next step after Tools-1.
But the quantified reporting mechanism (reports + verify) for each module should be fixed now.