Ana içeriğe geç

PR-E6 Dispatcher Canary Monitoring (24h)

1. Amac

Saha sorunu: Zeus EMS OCPP backend dispatcher'i, backend uptime saatler icinde dolduğunda 504 timeout vermeye basliyor. Manuel docker compose restart backend ile geciyor; hipoteze gore Redis pubsub session corrupt + listen_commands restart loop birbirini tetikliyor.

Hipotez: redis-py 5.x async-iterator generator'inde asyncio.CancelledError yutuluyor → outer retry loop generic except Exception path'ine dusuyor → sonsuz restart loop + pubsub session corrupt.

Fix (PR-E6): pubsub.listen() async-iterator pattern → pubsub.get_message(timeout=1.0, ignore_subscribe_messages=True) polling pattern'a refactor. CancelledError dogru propagate eder, outer loop sadece gercek bagli sorunlarda devreye girer.

Onceki fix (PR-E1): Redis owner-registry TTL refresh — heartbeat handler her cihaz heartbeat'inde TTL uzatir (zombie temizligi + owner consistency).

Birim test reproduce edemedi. Bu nedenle 24 saatlik production canary monitoring zorunlu. Eger fix yardimci olmazsa alternatif hipotezler (Redis pool exhaustion, EMQX broker, gunicorn worker recycle) arastirilmali.

2. Yeni Metrikler (PR-E6 canary)

MetricTipLabelAnlam
ocpp_dispatcher_loop_errors_totalCountererror_kind (5 deger)listen_commands outer retry loop hata sayaci. 24h boyunca 0 olmali.
ocpp_dispatcher_listening_activeGaugeyokAktif listen_commands task sayisi. ocpp_ws_connections_active'e yakin olmali.
ocpp_command_timeouts_total{timeout_kind="pubsub"}Counter (mevcut)action, timeout_kindCaller-side reply hic gelmedi durumu. PR-E6 fix sonrasi sadece gercek charger offline'da artmali.

error_kind whitelist (cardinality kontrol — max 5):

  • redis_connection — RedisConnectionError / ConnectionError
  • os_error — OSError (socket/dns)
  • timeout — asyncio.TimeoutError (beklenmedik path)
  • cancelled — CancelledError generic loop'a sizdi (BUG sinyali, asla olmamalı)
  • unknown — yukaridakine uymayan generic Exception

3. 24h Checkpoint Tablosu

Canary deploy sonrasi docker compose ps ile backend up/healthy oldugunu dogrula, ardindan asagidaki saatlerde Grafana OCPP Charger Fleet > PR-E6 Dispatcher Canary row'unu ac.

ZamanKontrolBeklenenAksiyon (eger sapma varsa)
T+0 (deploy sonrasi)Backend uptime panel'iYeni baslangic (saniye duzeyinde)
T+01× test komut: RemoteStartTransaction veya Reset< 5sn icinde Accepted/RejectedSustained timeout → rollback. Logs: ocpp_dispatcher_pubsub_timeout
T+0ocpp_dispatcher_loop_errors_total (5dk rate)0 (her error_kind icin)redis_connection > 0 → broker kontrol; cancelled/unknown > 0 → BUG, dispatcher.py inceleme
T+0Aktif Listener / WS Baglanti panel'iIki sayi yakin (oran ≥0.9)Buyuk fark → register/spawn lifecycle bozuk
T+1hincrease(ocpp_command_timeouts_total{timeout_kind="pubsub"}[1h])0 (gercek charger offline yoksa)>0 ve charger online → fix yetersiz
T+1hKomut latency p99 panel'i< 2sn (steady-state)>5sn sustained → Alert: OcppCommandLatencyP99High
T+6h1× test komut + uptime > 6hKomut Accepted, uptime monoton artiyorRestart oldu → unexpected crash, Alert: OcppBackendUptimeDrop
T+12h1× test komut (eski semptom penceresi)Komut Accepted < 5sn504 → fix yetersiz, alternatif hipoteze gec (bolum 5)
T+12hincrease(ocpp_dispatcher_loop_errors_total[12h])0>0 → error_kind breakdown (Grafana panel #11)
T+24hTum panel'ler — greendispatcher_loop_errors cumulative still 0, pubsub timeout sadece gercek offline charger'larClear → fix basarili, canary kapat. Aksi → bolum 5'e gec

4. Alarm Sinyalleri ve Eylemleri

Prometheus alert'lerinin tam tanimi: prometheus/rules/ocpp.yml (group: ocpp).

4.1 OcppCommandLatencyP99High (warning, 10m)

Anlam: Komut p99 latency 10dk boyunca >5sn. Saatler sonrasi bozulma sinyali — saha hipotezinin tam tezahuru.

Triage:

  1. Grafana PR-E6 Dispatcher Canary > Komut Latency p50/p95/p99 panel'ini ac. p99 hangi zaman tetiklendi?
  2. Loop Error Rate panel'inde ayni zaman diliminde error var mi? Varsa error_kind breakdown'ina bak.
  3. Backend uptime panel'i — restart loop var mi?
  4. Loglar: docker compose logs backend --since 30m | grep ocpp_dispatcher

Eylem:

  • Sustained (>30dk) ve dispatcher_loop_errors > 0 → rollback degerlendir. Slack #zeus-incidents'a haber ver, devops-deployment-agent ile koordineli main HEAD'e revert.
  • Latency yuksek ama loop_errors == 0 → backend event loop saturation; ek metric: python_gc_collections_total, request rate.

4.2 OcppDispatcherLoopErrors (warning, 15m)

Anlam: PR-E6 polling pattern'i hata veriyor. Saha hipotezi reproduce oldu.

Triage:

  1. Grafana PR-E6 Dispatcher Canary > Loop Error Rate panel'i — hangi error_kind?
    • redis_connection baskin → Redis broker kontrol et: docker compose logs redis --since 1h, redis-cli INFO clients (CLIENT LIST count, blocked_clients)
    • cancelled veya unknown baskin → BUG sinyali. Dispatcher.py listen_commands outer Exception path inceleme. Stack trace icin: docker compose logs backend --since 1h | grep ocpp_dispatcher_loop_error
    • timeout baskin → _listen_commands_once icinde beklenmedik asyncio.TimeoutError. pubsub.get_message(timeout=1.0) cagrisinda + 0.5sn asyncio.wait_for outer wrapping (varsa) inceleme.
  2. Backend uptime — son restart ne zaman?
  3. Eger yeni deploy sonrasi spike varsa, deploy commit'in dispatcher.py'a etkisi var mi kontrol et.

Eylem:

  • redis_connection sustained → alternatif hipotez 1 (Redis pool): bolum 5.1
  • cancelled / unknown → BUG, kod review gerekli. Slack #zeus-engineering'e cagri.

4.3 OcppDispatcherListenerLag (warning, 10m)

Anlam: Listen task'lari WS connection sayisinin %70'inin altinda. O charger'lar icin remote command (RemoteStart/Stop/Reset) calismaz.

Triage:

  1. Grafana Aktif Listener / WS Baglanti stat panel'i — fark ne kadar?
  2. Loglar: register() cagri sayisi vs listen_commands task spawn sayisi. Pattern: ocpp_registry_registered vs ocpp_dispatcher_listening.
  3. replace_and_cancel log'lari — eski task cancel sonrasi Gauge dec edildi mi?

Eylem:

  • Buyuk fark + register sayilari yuksek → router'da listen_commands task spawn edilmemis (regression). devops-deployment-agent ile incelemek.
  • Gauge negatif sayilara dusuyor → multiple dec, register pathlerinde inc unutulmus.

4.4 OcppBackendUptimeDrop (info, 5m)

Anlam: Backend container 5dk altinda uptime'a sahip — beklenmedik restart.

Triage:

  1. Planli deploy var mi? GitHub Actions deploy.yml log'una bak (CI/CD calisti mi).
  2. docker compose ps — exit code, restart count.
  3. docker compose logs backend --since 10m | head -200 — SIGTERM, OOM, panic var mi?
  4. dmesg | tail (host) — OOMKiller mesajı?

Eylem:

  • OOM → memory tuning, ayri triage (PR-E6 ile alakali degil).
  • SIGTERM (planli olmayan) → orchestrator (Compose) policy. restart: always zaten aktif olmali.
  • Crash + dispatcher_loop_errors spike → PR-E6 fix yetersiz, alternatif hipotez 3 (worker recycle): bolum 5.3.

5. Alternatif Hipotezler (PR-E6 fix yetersiz olursa)

Eger 24h sonunda hala 504 / restart loop yasaniyorsa, asagidaki hipotezleri sirasiyla incele.

5.1 Redis pool exhaustion

get_redis() connection pool'u doluyor olabilir; her pubsub() cagrisi yeni connection alir, return edilmiyorsa pool tukenir.

Kontrol:

# Redis connection count
docker compose exec redis redis-cli INFO clients
# blocked_clients yuksekse pubsub stuck
docker compose exec redis redis-cli CLIENT LIST | wc -l
# Backend tarafi: REDIS_MAX_CONNECTIONS env, connection pool stats

Beklenen: connected_clients < REDIS_MAX_CONNECTIONS (genelde 50-100 arasi).

Aksiyon: Pool sizing artir, pubsub.close() cagrisini her path'te garantiye al (suanda try/finally ile var).

5.2 EMQX broker disconnect rate

OCPP dispatcher Redis kullaniyor (EMQX degil), ancak telemetry/MQTT akisinda EMQX disconnect storm event loop saturation yaratip dispatcher latency'sini bozabilir.

Kontrol: Grafana MQTT Broker Health dashboard, EMQX Management UI > Connections panel.

5.3 Gunicorn / uvicorn worker recycle

Gunicorn --max-requests veya uvicorn worker timeout ile worker periyodik olarak restart oluyor olabilir; restart aninda pubsub task'lar drop oluyor.

Kontrol:

docker compose exec backend ps aux | grep -E "gunicorn|uvicorn"
# Worker pid'lerin yasi
docker compose logs backend --since 24h | grep -iE "worker|signal|sigterm" | tail -50

Beklenen: Worker pid'leri uzun yasamali (saatlerce). Booting worker with pid mesajlari sadece deploy aninda gelmeli.

Aksiyon: --max-requests 0 (sinirsiz) veya cok yuksek (1M+); uvicorn --timeout-graceful-shutdown 30+.

5.4 redis-py asyncio iterator BUG (eger PR-E6 fix yetersizse)

PR-E6 zaten pubsub.listen() -> get_message() gectigi icin bu hipotez teorik olarak fix kapsaminda. Eger yine de cancelled error_kind gozlemlenirse:

  • redis-py versiyon kontrol: pip show redis (5.x bekleniyor)
  • Issue tracker: github.com/redis/redis-py iterator + cancellederror

6. Test Komut Sablonu

Her checkpoint'te 1× test komut gondermek icin REST endpoint:

# Test charger CPID lazim — staging'de mevcut kayitli olan
CPID="STAGING-CP-001"
JWT_TOKEN="..." # admin token

curl -X POST "https://staging.zeus.local/api/v1/ocpp/chargers/${CPID}/reset" \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"type": "Soft"}' \
--max-time 30 \
-w "\nTotal: %{time_total}s\n"

Beklenen: < 5sn icinde HTTP 200 + {"status": "Accepted"}. Komut latency Grafana'da gozlemlenmeli.

7. Canary Onaylama Kriterleri

Canary kapatip "PR-E6 fix basarili" demek icin TUM kriterler saglanmali:

  • T+24h: increase(ocpp_dispatcher_loop_errors_total[24h]) == 0
  • T+24h: Komut p99 latency < 2sn (sustained, gece/gunduz periyodlari dahil)
  • T+24h: Backend uptime monoton artmali (>24h, planli deploy haricinde restart yok)
  • T+24h: 4× test komut (T+0, T+1h, T+12h, T+24h) tumu Accepted
  • T+24h: ocpp_dispatcher_listening_active / ocpp_ws_connections_active orani >= 0.9 (sustained)
  • T+24h: Saha 504 raporu yok
  • T+24h: Hicbir alarm tetiklenmedi (OcppCommandLatencyP99High / OcppDispatcherLoopErrors / OcppDispatcherListenerLag / OcppBackendUptimeDrop)

Kriterlerden biri saglanmiyorsa, root cause bulunmadan canary KAPATILMAZ. Bolum 5 alternatif hipotezleri devreye sok ve ilgili agent'a (devops-deployment-agent / backend-systems-architect / code-audit-sentinel) handoff yap.

8. Eskalasyon Yolu

  1. Triage: monitoring-observability-architect (bu doc'a bakar, alarm + dashboard yorum)
  2. Code inceleme: backend-systems-architect (dispatcher.py / commands.py)
  3. Audit: code-audit-sentinel (regresyon tespiti)
  4. Rollback: devops-deployment-agent (revert + redeploy)
  5. Final karar: chief-systems-orchestrator

9. Referanslar

  • Backend kod: backend/app/core/ocpp/dispatcher.py (listen_commands outer retry loop)
  • Metrik tanimlari: backend/app/core/observability/ocpp_metrics.py
  • Alert rule'lari: prometheus/rules/ocpp.yml (gruplar: ocpp, alarm 6-9 PR-E6 canary)
  • Dashboard: grafana/dashboards/ocpp-fleet.json (panel 11-15: PR-E6 Dispatcher Canary row)
  • Birim testler: backend/tests/features/ocpp/test_dispatcher.py
  • Onceki fix (PR-E1): Redis owner-registry TTL refresh
  • PR-E6 PR: #292