Ana içeriğe geç

OCPP Charger Liveness Middleware Runbook

PR-H/4-fix (2026-05-21) — Watchdog Mimari Degisikligi

Legacy heartbeat_watchdog Celery task'i deprecated. Heartbeat/inactivity kapsami artik PR-A1 (ocpp_charger_offline, WebSocket disconnect anlik high) + PR-A2 (ocpp_charger_offline_extended, 1h+ critical) domain evaluator'lari tarafindan saglanir. Bu runbook'taki heartbeat_watchdog referanslari tarihsel context olarak korunur; mevcut operasyonel kapsam PR-A1/A2 domain task'lari (ocpp_domain.check_charger_offline_extended) uzerinden takip edilir. Detay: OCPP Changelog — PR-H/4-fix.

Sahip: backend on-call (yedek: backend-systems-architect) Ilgili dashboard: Grafana > OCPP > OCPP Charger Liveness (uid: ocpp-charger-liveness) Ilgili alert dosyasi: prometheus/rules/ocpp_charger_liveness.yml Backend kaynak:

  • Middleware: backend/app/core/ocpp/v16/charge_point.py::_touch_last_seen (commit 97b3afe)
  • Watchdog: backend/app/tasks/ocpp_tasks.py::heartbeat_watchdog DEPRECATED PR-H/4-fix (no-op stub)
  • Domain evaluator (yeni): backend/app/features/alarms/ocpp_evaluators.py::evaluate_charger_offline_extended (PR-A2, 1h+ heartbeat threshold, critical)
  • Gauge refresh: backend/app/tasks/ocpp_tasks.py::refresh_charger_last_seen_lag
  • Metric tanimlari: backend/app/core/observability/ocpp_metrics.py

Migration referansi: 0043 (last_seen_at TIMESTAMPTZ NULL + ix_ocpp_chargers_last_seen_partial) + 0055 (PR-H/4-fix double-prefix cleanup) Son guncelleme: 2026-05-21 (PR-H/4-fix watchdog deprecate + saha bug temizlik)


0. Bu Runbook Neden Var?

Production motivasyon: PR-B oncesi heartbeat_watchdog SADECE last_heartbeat_at'a bakiyordu. Beny BCP-2ATN-P gibi 300sn default Heartbeat'i olan ama aktif sarjda 5-30sn aralikla MeterValues atan charger'lar yanlis offline isaretleniyordu.

Cozum (PR-B C5):

  • Migration 0043: ocpp_chargers.last_seen_at TIMESTAMPTZ NULL + partial index.
  • ZeusChargePoint16.route_message override: her inbound CALL'da UPDATE ocpp_chargers SET last_seen_at = now().
  • heartbeat_watchdog COALESCE(last_seen_at, last_heartbeat_at) bazli karar verir.

Bu runbook nereye bakar? Middleware UPDATE'leri patladiginda (DB pool tukenmis, network kesinti, schema drift) last_seen_at guncellenmiyor -> watchdog yine yanlis offline'a alir. PR-B'nin asil cozdugu sorun yine kucumusur. PR-B C8 observability bu gap'i kapatir.


1. Tetikleyici Alert'ler

Bu runbook'a yonlendiren alert'ler:

AlertSeverityforEsikAnlami
OcppMiddlewareTouchErrorRateHighcritical5mrate > 0.1/snMiddleware UPDATE sistemik FAIL.
OcppMiddlewareTouchSlowwarning10mp95 > 100msDB pool kontansiyon / Redis debounce sinyali.
OcppChargerLastSeenLagHighwarning5mtenant lag > 600snFleet sagligi gostergesi — watchdog offline isaretleyecek.
OcppMiddlewareTouchSuccessRateLowinfo15m< %95Dusuk volume + yuksek hata orani — pasif izleme.

Tum alert'ler component=ocpp ve slo=ocpp_liveness_middleware label'lari ile tag'lenir. Routing: critical -> page, warning -> ticket, info -> Slack thread.


2. Hizli Tani (5 dakika)

Adim 1: Dashboard'i ac

Grafana > OCPP klasoru > OCPP Charger Liveness:

  • URL: https://grafana.<host>/d/ocpp-charger-liveness
  • Default time range: son 3 saat

Adim 2: 4 panel'i sirayla kontrol et

  1. Panel 1 (Touch Success Rate)>0.99 normal. <0.95 = sistemik sorun.
  2. Panel 2 (Touch Fail Rate)>0.1/sn ise P1 firing.
  3. Panel 3 (Touch p95 Latency)>100ms ise P2 firing.
  4. Panel 4 (Fleet Max Lag)>600sn ise P3 firing.

Adim 3: Log ile cross-check (rate-limited!)

Middleware failure log'lari dakikada 1 charger basina sinirlandirilmis (_TOUCH_FAIL_LOG_INTERVAL_SEC = 60.0). Yani gercek failure rate metric'tedir, log sadece sample gosterir.

# Son 10dk middleware hatalari (rate-limited sample)
docker compose logs backend --since 10m 2>&1 | grep ocpp_last_seen_touch_failed | head -20

# Charger basina breakdown (cardinality safe — log uzerinden)
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_last_seen_touch_failed \
| jq -r '.charger_id // .extra.charger_id' \
| sort | uniq -c | sort -rn | head -10

# Action breakdown — hangi CALL tipi sirasinda fail oluyor?
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_last_seen_touch_failed \
| jq -r '.action // .extra.action' \
| sort | uniq -c | sort -rn

Adim 4: Pattern karari

GozlemHipotezEylem
P1 firing + DB pool util ~1.0Pool tukenmis§Senaryo 1
P1 firing + DB connection log errorPostgres down / network§Senaryo 3
P1 firing + last_seen_at kolonu yokSchema drift (migration 0043 kosmadi)§Senaryo 2
P2 firing + DB pool util > %50 sustainedRedis debounce karari§Senaryo 5
P3 firing + P1 firingMiddleware FAIL -> lag artiyorOnce P1 cozun
P3 firing + P1 normalCharger fleet connectivity§Senaryo 4
P4 firing tek basinaDusuk volume + hataPasif izleme

3. Eskalasyon Matrisi

SeviyeKimNe zamanIletisim
L1Backend on-callHemenSlack #zeus-incidents + PagerDuty
L2backend-systems-architectL1 30dk icinde cozemezseSlack DM
L3CTOL2 1sa icinde cozemezse veya etki >50 chargerTelefon

P1 critical (OcppMiddlewareTouchErrorRateHigh):

  • 30dk icinde cozulmezse -> CTO escalation.
  • Migration drift (§Senaryo 2) ise: devops-deployment-agent ile koordineli alembic upgrade head.

P2/P3 warning:

  • Backend on-call ticket -> 4 saatte triage.
  • P3 + P1 birlikte: P1'i once cozun, P3 tek basina trend takip et.

4. Sik Gorulen Senaryolar

Senaryo 1: DB connection pool exhaustion

Belirti:

  • OcppMiddlewareTouchErrorRateHigh firing
  • Grafana Panel 9 (DB Pool Utilization) ~%100
  • Logs: QueuePool limit of size N overflow M reached veya asyncpg.PoolTimeoutError

Tani:

# DB pool durumu
curl -s http://localhost:8000/metrics | grep zeus_db_pool_

# Aktif Postgres baglantilar
docker compose exec db psql -U zeus -c "SELECT count(*), state \
FROM pg_stat_activity WHERE datname='zeus' GROUP BY state;"

Mitigation:

  1. Kisa vade: Backend restart pool reset eder (5dk geri donus):
    docker compose restart backend
  2. Orta vade: DATABASE_POOL_SIZE (default 10) ve DATABASE_MAX_OVERFLOW (default 20) artir.
  3. Uzun vade: Redis tabanli 30sn debounce ekle (§Senaryo 5).

Onleme: Panel 9 zeus_db_pool_utilization > 0.7 sustained ise pool size'i proaktif buyutmek gerekir.


Senaryo 2: Migration drift (last_seen_at kolonu yok)

Belirti:

  • OcppMiddlewareTouchErrorRateHigh firing
  • Logs: asyncpg.UndefinedColumnError: column "last_seen_at" does not exist
  • Tum charger'lar etkilenir (broad blast radius)

Tani:

# Mevcut migration durumu
docker compose exec backend alembic current

# Beklenen: 0043_xxx (last_seen_at + partial index)
# Eger 0042 veya oncesi ise migration kosmamis

Mitigation:

# Migration uygula (production-safe — additive)
docker compose exec backend alembic upgrade head

# Dogrula
docker compose exec backend alembic current
docker compose exec db psql -U zeus -c \
"\d ocpp_chargers" | grep last_seen_at

Onleme: Deploy pipeline alembic upgrade head adimini icermeli. CI'da alembic check (sql drift) gate.


Senaryo 3: PostgreSQL down / network kesintisi

Belirti:

  • OcppMiddlewareTouchErrorRateHigh firing
  • Logs: asyncpg.ConnectionDoesNotExistError, OSError: connection refused
  • Diger DB-dependant metric'ler de bozulur (zeus_db_pool_* 0 veya yok)

Tani:

# Postgres container health
docker compose ps db
docker compose logs db --tail 50

# Network erisim
docker compose exec backend nc -zv db 5432

Mitigation:

# Postgres restart
docker compose restart db

# Backend reconnect (pool refresh)
docker compose restart backend

Onleme: Postgres health probe (pg_isready) docker-compose healthcheck blogunda; restart: unless-stopped policy.


Senaryo 4: Charger fleet connectivity sorunu

Belirti:

  • OcppChargerLastSeenLagHigh firing (P3)
  • P1 (OcppMiddlewareTouchErrorRateHigh) NORMAL — middleware calisiyor
  • Tek tenant veya tek charger group etkilenir
  • Panel 7 (Lag by Tenant) bar chart bir tenant'i isaretler

Tani:

# Tenant icinde son aktivite gore charger listesi
docker compose exec db psql -U zeus -c \
"SELECT cpid, last_seen_at, last_heartbeat_at, status, \
EXTRACT(EPOCH FROM (now() - last_seen_at))::int AS lag_sec \
FROM ocpp_chargers \
WHERE tenant_id = '<TENANT_ID>' \
ORDER BY last_seen_at NULLS FIRST LIMIT 10;"

# Aktif WS baglanti vs toplam charger
curl -s http://localhost:8000/metrics | grep -E "ocpp_ws_connections_active|zeus_ocpp_chargers_total"

Mitigation:

  • Vendor entegrasyon ekibi bilgilendir (charger basina connectivity sorunu).
  • Charger'a fiziksel erisim varsa: 4G modem / WiFi sinyali kontrol.
  • Network seviyesinde: nginx /ocpp/ access log -> charger IP traffic var mi?

Onleme: Charger basina health check (LWT MQTT) — OCPP'de spec disi, vendor-specific.


Senaryo 5: Yuksek MeterValues frequency -> Redis debounce karari

Belirti:

  • OcppMiddlewareTouchSlow firing (P2)
  • Panel 6 (Latency Percentiles) p95 > 100ms sustained 24sa+
  • Panel 9 (DB Pool) %50+ utilization sustained
  • Charger fleet'inde Beny / yuksek-MeterValues frequency vendor cogunluk

Tani:

# MeterValues message rate / aktif charger
curl -s http://localhost:8000/metrics | grep -E \
"ocpp_messages_total{action=\"MeterValues\"|ocpp_ws_connections_active"

Hesap: rate(ocpp_messages_total{action="MeterValues"}[5m]) / sum(ocpp_ws_connections_active). 30 charger'da rate ~6/sn (her 5sn'de bir) = her charger 5sn'de MeterValues atiyor.

Mitigation (sonraki PR — bu PR'da degil!):

# Redis tabanli 30sn debounce — placeholder, sonraki PR
async def _touch_last_seen_debounced(self, action: str | None = None) -> None:
key = f"ocpp:last_seen_touch:{self._charger_id}"
# SET NX EX — sadece key yoksa set et, 30sn TTL
locked = await redis.set(key, "1", nx=True, ex=30)
if not locked:
# 30sn icinde zaten UPDATE yapildi, atla
return
await self._touch_last_seen(action=action)

Karar destegi:

  • p95 latency > 50ms surekli (24sa+) -> debounce ekle
  • DB pool utilization > %50 surekli -> debounce ekle
  • Aksi takdirde mevcut basit middleware yeterli (her CALL'da UPDATE; PostgreSQL UPDATE ... WHERE id = ? partial index ile pratikte 1-5ms).

Trade-off:

  • Debounce 30sn pencerede tek UPDATE -> last_seen_at 30sn'e kadar stale olabilir. Watchdog 2*heartbeat_interval (default 120sn) esiginde isler -> 30sn stale absorbe edilir.
  • Debounce eklenirse: ocpp_middleware_touch_total{result="success"} rate dususler — yeni baseline beklenir, alarm threshold'lari yeniden tune edilebilir.

5. Profile Sonrasi Karar (Redis Debounce Gerekli mi?)

Bu runbook'un en kritik karar destegi: C5 middleware'in for: 24h profile penceresi sonrasi Redis debounce eklenmesi gerekiyor mu?

Karar Matrisi

Gozlem (24sa pencere)Karar
p95 latency <50ms sustainedDebounce GEREKMEZ — mevcut basit middleware yeterli.
p95 latency 50-100ms sustainedTrend takip et — fleet buyumesi ile birlikte tekrar degerlendir.
p95 latency >100ms sustainedDebounce ekle — sonraki PR.
DB pool util <%30 sustainedDebounce GEREKMEZ
DB pool util >%50 sustainedDebounce ekle — pool baski azaltir.
OcppMiddlewareTouchSlow cikip soniyor (transient)Pool size'i artir; debounce ekleme.
OcppMiddlewareTouchSlow sustained 24sa+Debounce ekle.

Profile Verisi Toplama

# 24sa boyunca toplanan p95 latency snapshot'lari
curl -s "http://prometheus:9090/api/v1/query_range?\
query=histogram_quantile(0.95,sum%20by%20(le)(rate(ocpp_middleware_touch_duration_seconds_bucket[5m])))&\
start=$(date -u -d '24 hours ago' +%s)&end=$(date -u +%s)&step=300" | \
jq '.data.result[0].values[] | .[1]' | sort -n | tail -10

# DB pool util max
curl -s "http://prometheus:9090/api/v1/query_range?\
query=max(zeus_db_pool_utilization)&\
start=$(date -u -d '24 hours ago' +%s)&end=$(date -u +%s)&step=300" | \
jq '.data.result[0].values[] | .[1]' | sort -n | tail -5

Karar Sonrasi

Debounce eklenirse:

  • Sonraki PR'da _touch_last_seen_debounced implementation.
  • Bu runbook'da P2 alert threshold (>0.1 -> >0.5) revize edilir (debounce ile p95 dramaticly dusumeli).
  • Yeni metric: ocpp_middleware_touch_debounced_total{result} — debounce ile atlanan UPDATE sayaci.

Debounce eklenmezse:

  • 6 ay sonra tekrar profil — fleet buyudukce baski artar.
  • DB pool baski olusursa Senaryo 1'e gec.

6. Regression Koruma

PR-B C8 sonrasi bu invariant'lar bozulmamali:

  1. _touch_last_seen her CALL'da cagrilirtest_charge_point_middleware.py test suite garanti eder.
  2. Exception yutulurroute_message parent dispatch'i bozulmaz.
  3. Rate-limited log_TOUCH_FAIL_LOG_INTERVAL_SEC = 60.0 saniyede 30+ log spam'i engeller.
  4. Metric cardinalityresult (2 deger) + tenant_id (~10 deger) disinda label EKLENMEZ.
  5. Severity high (watchdog)OCPP_HEARTBEAT_TIMEOUT alarm severity medium -> high (C8 update).

Pre-deploy checklist

  • alembic current 0043 veya sonrasi
  • prometheus/rules/ocpp_charger_liveness.yml Prometheus'da yuklendi
  • grafana/dashboards/ocpp-charger-liveness.json Grafana'da gozukur
  • Celery beat ocpp_refresh_charger_last_seen_lag task'i 60sn'de bir calisir
  • /metrics endpoint ocpp_middleware_touch_total, ocpp_middleware_touch_duration_seconds, ocpp_charger_last_seen_lag_seconds export eder

Post-deploy 24sa canary

  • P1/P2/P3 alarm'lari firing durumda DEGIL
  • Panel 1 (Success Rate) >0.99
  • Panel 7 (Lag by Tenant) tum tenant'lar <300sn
  • Log'da ocpp_last_seen_touch_failed event'i nadir (saatte birkac'tan az)

7. Referanslar

  • C5 commit (97b3afe): ZeusChargePoint16.route_message override + middleware
  • Migration 0043: last_seen_at TIMESTAMPTZ NULL + ix_ocpp_chargers_last_seen_partial
  • C7 audit raporu — 4 MINOR: log spam, autovacuum, defansif suppress, action context
  • C8 (bu PR): Observability — metric + alert + dashboard + runbook + severity high
  • Dashboard: grafana/dashboards/ocpp-charger-liveness.json
  • Alert dosyasi: prometheus/rules/ocpp_charger_liveness.yml
  • Test suite: backend/tests/features/ocpp/test_charge_point_middleware.py, backend/tests/features/ocpp/test_heartbeat_watchdog.py