PR-E6 Dispatcher Canary Monitoring (24h)
1. Amac
Saha sorunu: Zeus EMS OCPP backend dispatcher'i, backend uptime saatler icinde dolduğunda 504 timeout vermeye basliyor. Manuel docker compose restart backend ile geciyor; hipoteze gore Redis pubsub session corrupt + listen_commands restart loop birbirini tetikliyor.
Hipotez: redis-py 5.x async-iterator generator'inde asyncio.CancelledError yutuluyor → outer retry loop generic except Exception path'ine dusuyor → sonsuz restart loop + pubsub session corrupt.
Fix (PR-E6): pubsub.listen() async-iterator pattern → pubsub.get_message(timeout=1.0, ignore_subscribe_messages=True) polling pattern'a refactor. CancelledError dogru propagate eder, outer loop sadece gercek bagli sorunlarda devreye girer.
Onceki fix (PR-E1): Redis owner-registry TTL refresh — heartbeat handler her cihaz heartbeat'inde TTL uzatir (zombie temizligi + owner consistency).
Birim test reproduce edemedi. Bu nedenle 24 saatlik production canary monitoring zorunlu. Eger fix yardimci olmazsa alternatif hipotezler (Redis pool exhaustion, EMQX broker, gunicorn worker recycle) arastirilmali.
2. Yeni Metrikler (PR-E6 canary)
| Metric | Tip | Label | Anlam |
|---|---|---|---|
ocpp_dispatcher_loop_errors_total | Counter | error_kind (5 deger) | listen_commands outer retry loop hata sayaci. 24h boyunca 0 olmali. |
ocpp_dispatcher_listening_active | Gauge | yok | Aktif listen_commands task sayisi. ocpp_ws_connections_active'e yakin olmali. |
ocpp_command_timeouts_total{timeout_kind="pubsub"} | Counter (mevcut) | action, timeout_kind | Caller-side reply hic gelmedi durumu. PR-E6 fix sonrasi sadece gercek charger offline'da artmali. |
error_kind whitelist (cardinality kontrol — max 5):
redis_connection— RedisConnectionError / ConnectionErroros_error— OSError (socket/dns)timeout— asyncio.TimeoutError (beklenmedik path)cancelled— CancelledError generic loop'a sizdi (BUG sinyali, asla olmamalı)unknown— yukaridakine uymayan generic Exception
3. 24h Checkpoint Tablosu
Canary deploy sonrasi docker compose ps ile backend up/healthy oldugunu dogrula, ardindan asagidaki saatlerde Grafana OCPP Charger Fleet > PR-E6 Dispatcher Canary row'unu ac.
| Zaman | Kontrol | Beklenen | Aksiyon (eger sapma varsa) |
|---|---|---|---|
| T+0 (deploy sonrasi) | Backend uptime panel'i | Yeni baslangic (saniye duzeyinde) | — |
| T+0 | 1× test komut: RemoteStartTransaction veya Reset | < 5sn icinde Accepted/Rejected | Sustained timeout → rollback. Logs: ocpp_dispatcher_pubsub_timeout |
| T+0 | ocpp_dispatcher_loop_errors_total (5dk rate) | 0 (her error_kind icin) | redis_connection > 0 → broker kontrol; cancelled/unknown > 0 → BUG, dispatcher.py inceleme |
| T+0 | Aktif Listener / WS Baglanti panel'i | Iki sayi yakin (oran ≥0.9) | Buyuk fark → register/spawn lifecycle bozuk |
| T+1h | increase(ocpp_command_timeouts_total{timeout_kind="pubsub"}[1h]) | 0 (gercek charger offline yoksa) | >0 ve charger online → fix yetersiz |
| T+1h | Komut latency p99 panel'i | < 2sn (steady-state) | >5sn sustained → Alert: OcppCommandLatencyP99High |
| T+6h | 1× test komut + uptime > 6h | Komut Accepted, uptime monoton artiyor | Restart oldu → unexpected crash, Alert: OcppBackendUptimeDrop |
| T+12h | 1× test komut (eski semptom penceresi) | Komut Accepted < 5sn | 504 → fix yetersiz, alternatif hipoteze gec (bolum 5) |
| T+12h | increase(ocpp_dispatcher_loop_errors_total[12h]) | 0 | >0 → error_kind breakdown (Grafana panel #11) |
| T+24h | Tum panel'ler — green | dispatcher_loop_errors cumulative still 0, pubsub timeout sadece gercek offline charger'lar | Clear → fix basarili, canary kapat. Aksi → bolum 5'e gec |
4. Alarm Sinyalleri ve Eylemleri
Prometheus alert'lerinin tam tanimi: prometheus/rules/ocpp.yml (group: ocpp).
4.1 OcppCommandLatencyP99High (warning, 10m)
Anlam: Komut p99 latency 10dk boyunca >5sn. Saatler sonrasi bozulma sinyali — saha hipotezinin tam tezahuru.
Triage:
- Grafana PR-E6 Dispatcher Canary > Komut Latency p50/p95/p99 panel'ini ac. p99 hangi zaman tetiklendi?
- Loop Error Rate panel'inde ayni zaman diliminde error var mi? Varsa error_kind breakdown'ina bak.
- Backend uptime panel'i — restart loop var mi?
- Loglar:
docker compose logs backend --since 30m | grep ocpp_dispatcher
Eylem:
- Sustained (>30dk) ve dispatcher_loop_errors > 0 → rollback degerlendir. Slack #zeus-incidents'a haber ver, devops-deployment-agent ile koordineli main HEAD'e revert.
- Latency yuksek ama loop_errors == 0 → backend event loop saturation; ek metric:
python_gc_collections_total, request rate.
4.2 OcppDispatcherLoopErrors (warning, 15m)
Anlam: PR-E6 polling pattern'i hata veriyor. Saha hipotezi reproduce oldu.
Triage:
- Grafana PR-E6 Dispatcher Canary > Loop Error Rate panel'i — hangi error_kind?
redis_connectionbaskin → Redis broker kontrol et:docker compose logs redis --since 1h,redis-cli INFO clients(CLIENT LIST count, blocked_clients)cancelledveyaunknownbaskin → BUG sinyali. Dispatcher.pylisten_commandsouter Exception path inceleme. Stack trace icin:docker compose logs backend --since 1h | grep ocpp_dispatcher_loop_errortimeoutbaskin →_listen_commands_onceicinde beklenmedik asyncio.TimeoutError.pubsub.get_message(timeout=1.0)cagrisinda + 0.5snasyncio.wait_forouter wrapping (varsa) inceleme.
- Backend uptime — son restart ne zaman?
- Eger yeni deploy sonrasi spike varsa, deploy commit'in dispatcher.py'a etkisi var mi kontrol et.
Eylem:
- redis_connection sustained → alternatif hipotez 1 (Redis pool): bolum 5.1
- cancelled / unknown → BUG, kod review gerekli. Slack #zeus-engineering'e cagri.
4.3 OcppDispatcherListenerLag (warning, 10m)
Anlam: Listen task'lari WS connection sayisinin %70'inin altinda. O charger'lar icin remote command (RemoteStart/Stop/Reset) calismaz.
Triage:
- Grafana Aktif Listener / WS Baglanti stat panel'i — fark ne kadar?
- Loglar:
register()cagri sayisi vslisten_commandstask spawn sayisi. Pattern:ocpp_registry_registeredvsocpp_dispatcher_listening. replace_and_cancellog'lari — eski task cancel sonrasi Gauge dec edildi mi?
Eylem:
- Buyuk fark + register sayilari yuksek → router'da
listen_commandstask spawn edilmemis (regression). devops-deployment-agent ile incelemek. - Gauge negatif sayilara dusuyor → multiple dec, register pathlerinde inc unutulmus.
4.4 OcppBackendUptimeDrop (info, 5m)
Anlam: Backend container 5dk altinda uptime'a sahip — beklenmedik restart.
Triage:
- Planli deploy var mi? GitHub Actions
deploy.ymllog'una bak (CI/CD calisti mi). docker compose ps— exit code, restart count.docker compose logs backend --since 10m | head -200— SIGTERM, OOM, panic var mi?dmesg | tail(host) — OOMKiller mesajı?
Eylem:
- OOM → memory tuning, ayri triage (PR-E6 ile alakali degil).
- SIGTERM (planli olmayan) → orchestrator (Compose) policy.
restart: alwayszaten aktif olmali. - Crash + dispatcher_loop_errors spike → PR-E6 fix yetersiz, alternatif hipotez 3 (worker recycle): bolum 5.3.
5. Alternatif Hipotezler (PR-E6 fix yetersiz olursa)
Eger 24h sonunda hala 504 / restart loop yasaniyorsa, asagidaki hipotezleri sirasiyla incele.
5.1 Redis pool exhaustion
get_redis() connection pool'u doluyor olabilir; her pubsub() cagrisi yeni connection alir, return edilmiyorsa pool tukenir.
Kontrol:
# Redis connection count
docker compose exec redis redis-cli INFO clients
# blocked_clients yuksekse pubsub stuck
docker compose exec redis redis-cli CLIENT LIST | wc -l
# Backend tarafi: REDIS_MAX_CONNECTIONS env, connection pool stats
Beklenen: connected_clients < REDIS_MAX_CONNECTIONS (genelde 50-100 arasi).
Aksiyon: Pool sizing artir, pubsub.close() cagrisini her path'te garantiye al (suanda try/finally ile var).
5.2 EMQX broker disconnect rate
OCPP dispatcher Redis kullaniyor (EMQX degil), ancak telemetry/MQTT akisinda EMQX disconnect storm event loop saturation yaratip dispatcher latency'sini bozabilir.
Kontrol: Grafana MQTT Broker Health dashboard, EMQX Management UI > Connections panel.
5.3 Gunicorn / uvicorn worker recycle
Gunicorn --max-requests veya uvicorn worker timeout ile worker periyodik olarak restart oluyor olabilir; restart aninda pubsub task'lar drop oluyor.
Kontrol:
docker compose exec backend ps aux | grep -E "gunicorn|uvicorn"
# Worker pid'lerin yasi
docker compose logs backend --since 24h | grep -iE "worker|signal|sigterm" | tail -50
Beklenen: Worker pid'leri uzun yasamali (saatlerce). Booting worker with pid mesajlari sadece deploy aninda gelmeli.
Aksiyon: --max-requests 0 (sinirsiz) veya cok yuksek (1M+); uvicorn --timeout-graceful-shutdown 30+.
5.4 redis-py asyncio iterator BUG (eger PR-E6 fix yetersizse)
PR-E6 zaten pubsub.listen() -> get_message() gectigi icin bu hipotez teorik olarak fix kapsaminda. Eger yine de cancelled error_kind gozlemlenirse:
- redis-py versiyon kontrol:
pip show redis(5.x bekleniyor) - Issue tracker: github.com/redis/redis-py iterator + cancellederror
6. Test Komut Sablonu
Her checkpoint'te 1× test komut gondermek icin REST endpoint:
# Test charger CPID lazim — staging'de mevcut kayitli olan
CPID="STAGING-CP-001"
JWT_TOKEN="..." # admin token
curl -X POST "https://staging.zeus.local/api/v1/ocpp/chargers/${CPID}/reset" \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"type": "Soft"}' \
--max-time 30 \
-w "\nTotal: %{time_total}s\n"
Beklenen: < 5sn icinde HTTP 200 + {"status": "Accepted"}. Komut latency Grafana'da gozlemlenmeli.
7. Canary Onaylama Kriterleri
Canary kapatip "PR-E6 fix basarili" demek icin TUM kriterler saglanmali:
- T+24h:
increase(ocpp_dispatcher_loop_errors_total[24h])== 0 - T+24h: Komut p99 latency < 2sn (sustained, gece/gunduz periyodlari dahil)
- T+24h: Backend uptime monoton artmali (>24h, planli deploy haricinde restart yok)
- T+24h: 4× test komut (T+0, T+1h, T+12h, T+24h) tumu Accepted
- T+24h:
ocpp_dispatcher_listening_active/ocpp_ws_connections_activeorani >= 0.9 (sustained) - T+24h: Saha 504 raporu yok
- T+24h: Hicbir alarm tetiklenmedi (OcppCommandLatencyP99High / OcppDispatcherLoopErrors / OcppDispatcherListenerLag / OcppBackendUptimeDrop)
Kriterlerden biri saglanmiyorsa, root cause bulunmadan canary KAPATILMAZ. Bolum 5 alternatif hipotezleri devreye sok ve ilgili agent'a (devops-deployment-agent / backend-systems-architect / code-audit-sentinel) handoff yap.
8. Eskalasyon Yolu
- Triage: monitoring-observability-architect (bu doc'a bakar, alarm + dashboard yorum)
- Code inceleme: backend-systems-architect (dispatcher.py / commands.py)
- Audit: code-audit-sentinel (regresyon tespiti)
- Rollback: devops-deployment-agent (revert + redeploy)
- Final karar: chief-systems-orchestrator
9. Referanslar
- Backend kod:
backend/app/core/ocpp/dispatcher.py(listen_commands outer retry loop) - Metrik tanimlari:
backend/app/core/observability/ocpp_metrics.py - Alert rule'lari:
prometheus/rules/ocpp.yml(gruplar: ocpp, alarm 6-9 PR-E6 canary) - Dashboard:
grafana/dashboards/ocpp-fleet.json(panel 11-15: PR-E6 Dispatcher Canary row) - Birim testler:
backend/tests/features/ocpp/test_dispatcher.py - Onceki fix (PR-E1): Redis owner-registry TTL refresh
- PR-E6 PR: #292