OCPP Charger Liveness Middleware Runbook
Legacy heartbeat_watchdog Celery task'i deprecated. Heartbeat/inactivity
kapsami artik PR-A1 (ocpp_charger_offline, WebSocket disconnect anlik
high) + PR-A2 (ocpp_charger_offline_extended, 1h+ critical) domain
evaluator'lari tarafindan saglanir. Bu runbook'taki heartbeat_watchdog
referanslari tarihsel context olarak korunur; mevcut operasyonel kapsam
PR-A1/A2 domain task'lari (ocpp_domain.check_charger_offline_extended)
uzerinden takip edilir. Detay:
OCPP Changelog — PR-H/4-fix.
Sahip: backend on-call (yedek: backend-systems-architect)
Ilgili dashboard: Grafana > OCPP > OCPP Charger Liveness (uid: ocpp-charger-liveness)
Ilgili alert dosyasi: prometheus/rules/ocpp_charger_liveness.yml
Backend kaynak:
- Middleware:
backend/app/core/ocpp/v16/charge_point.py::_touch_last_seen(commit97b3afe) Watchdog:DEPRECATED PR-H/4-fix (no-op stub)backend/app/tasks/ocpp_tasks.py::heartbeat_watchdog- Domain evaluator (yeni):
backend/app/features/alarms/ocpp_evaluators.py::evaluate_charger_offline_extended(PR-A2, 1h+ heartbeat threshold, critical) - Gauge refresh:
backend/app/tasks/ocpp_tasks.py::refresh_charger_last_seen_lag - Metric tanimlari:
backend/app/core/observability/ocpp_metrics.py
Migration referansi: 0043 (last_seen_at TIMESTAMPTZ NULL + ix_ocpp_chargers_last_seen_partial) + 0055 (PR-H/4-fix double-prefix cleanup)
Son guncelleme: 2026-05-21 (PR-H/4-fix watchdog deprecate + saha bug temizlik)
0. Bu Runbook Neden Var?
Production motivasyon: PR-B oncesi heartbeat_watchdog SADECE last_heartbeat_at'a bakiyordu. Beny BCP-2ATN-P gibi 300sn default Heartbeat'i olan ama aktif sarjda 5-30sn aralikla MeterValues atan charger'lar yanlis offline isaretleniyordu.
Cozum (PR-B C5):
- Migration 0043:
ocpp_chargers.last_seen_at TIMESTAMPTZ NULL+ partial index. ZeusChargePoint16.route_messageoverride: her inbound CALL'daUPDATE ocpp_chargers SET last_seen_at = now().heartbeat_watchdogCOALESCE(last_seen_at, last_heartbeat_at)bazli karar verir.
Bu runbook nereye bakar?
Middleware UPDATE'leri patladiginda (DB pool tukenmis, network kesinti, schema drift) last_seen_at guncellenmiyor -> watchdog yine yanlis offline'a alir. PR-B'nin asil cozdugu sorun yine kucumusur. PR-B C8 observability bu gap'i kapatir.
1. Tetikleyici Alert'ler
Bu runbook'a yonlendiren alert'ler:
| Alert | Severity | for | Esik | Anlami |
|---|---|---|---|---|
OcppMiddlewareTouchErrorRateHigh | critical | 5m | rate > 0.1/sn | Middleware UPDATE sistemik FAIL. |
OcppMiddlewareTouchSlow | warning | 10m | p95 > 100ms | DB pool kontansiyon / Redis debounce sinyali. |
OcppChargerLastSeenLagHigh | warning | 5m | tenant lag > 600sn | Fleet sagligi gostergesi — watchdog offline isaretleyecek. |
OcppMiddlewareTouchSuccessRateLow | info | 15m | < %95 | Dusuk volume + yuksek hata orani — pasif izleme. |
Tum alert'ler component=ocpp ve slo=ocpp_liveness_middleware label'lari ile tag'lenir. Routing: critical -> page, warning -> ticket, info -> Slack thread.
2. Hizli Tani (5 dakika)
Adim 1: Dashboard'i ac
Grafana > OCPP klasoru > OCPP Charger Liveness:
- URL:
https://grafana.<host>/d/ocpp-charger-liveness - Default time range: son 3 saat
Adim 2: 4 panel'i sirayla kontrol et
- Panel 1 (Touch Success Rate) —
>0.99normal.<0.95= sistemik sorun. - Panel 2 (Touch Fail Rate) —
>0.1/snise P1 firing. - Panel 3 (Touch p95 Latency) —
>100msise P2 firing. - Panel 4 (Fleet Max Lag) —
>600snise P3 firing.
Adim 3: Log ile cross-check (rate-limited!)
Middleware failure log'lari dakikada 1 charger basina sinirlandirilmis (_TOUCH_FAIL_LOG_INTERVAL_SEC = 60.0). Yani gercek failure rate metric'tedir, log sadece sample gosterir.
# Son 10dk middleware hatalari (rate-limited sample)
docker compose logs backend --since 10m 2>&1 | grep ocpp_last_seen_touch_failed | head -20
# Charger basina breakdown (cardinality safe — log uzerinden)
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_last_seen_touch_failed \
| jq -r '.charger_id // .extra.charger_id' \
| sort | uniq -c | sort -rn | head -10
# Action breakdown — hangi CALL tipi sirasinda fail oluyor?
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_last_seen_touch_failed \
| jq -r '.action // .extra.action' \
| sort | uniq -c | sort -rn
Adim 4: Pattern karari
| Gozlem | Hipotez | Eylem |
|---|---|---|
| P1 firing + DB pool util ~1.0 | Pool tukenmis | §Senaryo 1 |
| P1 firing + DB connection log error | Postgres down / network | §Senaryo 3 |
P1 firing + last_seen_at kolonu yok | Schema drift (migration 0043 kosmadi) | §Senaryo 2 |
| P2 firing + DB pool util > %50 sustained | Redis debounce karari | §Senaryo 5 |
| P3 firing + P1 firing | Middleware FAIL -> lag artiyor | Once P1 cozun |
| P3 firing + P1 normal | Charger fleet connectivity | §Senaryo 4 |
| P4 firing tek basina | Dusuk volume + hata | Pasif izleme |
3. Eskalasyon Matrisi
| Seviye | Kim | Ne zaman | Iletisim |
|---|---|---|---|
| L1 | Backend on-call | Hemen | Slack #zeus-incidents + PagerDuty |
| L2 | backend-systems-architect | L1 30dk icinde cozemezse | Slack DM |
| L3 | CTO | L2 1sa icinde cozemezse veya etki >50 charger | Telefon |
P1 critical (OcppMiddlewareTouchErrorRateHigh):
- 30dk icinde cozulmezse -> CTO escalation.
- Migration drift (§Senaryo 2) ise:
devops-deployment-agentile koordinelialembic upgrade head.
P2/P3 warning:
- Backend on-call ticket -> 4 saatte triage.
- P3 + P1 birlikte: P1'i once cozun, P3 tek basina trend takip et.
4. Sik Gorulen Senaryolar
Senaryo 1: DB connection pool exhaustion
Belirti:
OcppMiddlewareTouchErrorRateHighfiring- Grafana Panel 9 (DB Pool Utilization) ~%100
- Logs:
QueuePool limit of size N overflow M reachedveyaasyncpg.PoolTimeoutError
Tani:
# DB pool durumu
curl -s http://localhost:8000/metrics | grep zeus_db_pool_
# Aktif Postgres baglantilar
docker compose exec db psql -U zeus -c "SELECT count(*), state \
FROM pg_stat_activity WHERE datname='zeus' GROUP BY state;"
Mitigation:
- Kisa vade: Backend restart pool reset eder (5dk geri donus):
docker compose restart backend - Orta vade:
DATABASE_POOL_SIZE(default 10) veDATABASE_MAX_OVERFLOW(default 20) artir. - Uzun vade: Redis tabanli 30sn debounce ekle (§Senaryo 5).
Onleme: Panel 9 zeus_db_pool_utilization > 0.7 sustained ise pool size'i proaktif buyutmek gerekir.
Senaryo 2: Migration drift (last_seen_at kolonu yok)
Belirti:
OcppMiddlewareTouchErrorRateHighfiring- Logs:
asyncpg.UndefinedColumnError: column "last_seen_at" does not exist - Tum charger'lar etkilenir (broad blast radius)
Tani:
# Mevcut migration durumu
docker compose exec backend alembic current
# Beklenen: 0043_xxx (last_seen_at + partial index)
# Eger 0042 veya oncesi ise migration kosmamis
Mitigation:
# Migration uygula (production-safe — additive)
docker compose exec backend alembic upgrade head
# Dogrula
docker compose exec backend alembic current
docker compose exec db psql -U zeus -c \
"\d ocpp_chargers" | grep last_seen_at
Onleme: Deploy pipeline alembic upgrade head adimini icermeli. CI'da alembic check (sql drift) gate.
Senaryo 3: PostgreSQL down / network kesintisi
Belirti:
OcppMiddlewareTouchErrorRateHighfiring- Logs:
asyncpg.ConnectionDoesNotExistError,OSError: connection refused - Diger DB-dependant metric'ler de bozulur (
zeus_db_pool_*0 veya yok)
Tani:
# Postgres container health
docker compose ps db
docker compose logs db --tail 50
# Network erisim
docker compose exec backend nc -zv db 5432
Mitigation:
# Postgres restart
docker compose restart db
# Backend reconnect (pool refresh)
docker compose restart backend
Onleme: Postgres health probe (pg_isready) docker-compose healthcheck blogunda; restart: unless-stopped policy.
Senaryo 4: Charger fleet connectivity sorunu
Belirti:
OcppChargerLastSeenLagHighfiring (P3)- P1 (
OcppMiddlewareTouchErrorRateHigh) NORMAL — middleware calisiyor - Tek tenant veya tek charger group etkilenir
- Panel 7 (Lag by Tenant) bar chart bir tenant'i isaretler
Tani:
# Tenant icinde son aktivite gore charger listesi
docker compose exec db psql -U zeus -c \
"SELECT cpid, last_seen_at, last_heartbeat_at, status, \
EXTRACT(EPOCH FROM (now() - last_seen_at))::int AS lag_sec \
FROM ocpp_chargers \
WHERE tenant_id = '<TENANT_ID>' \
ORDER BY last_seen_at NULLS FIRST LIMIT 10;"
# Aktif WS baglanti vs toplam charger
curl -s http://localhost:8000/metrics | grep -E "ocpp_ws_connections_active|zeus_ocpp_chargers_total"
Mitigation:
- Vendor entegrasyon ekibi bilgilendir (charger basina connectivity sorunu).
- Charger'a fiziksel erisim varsa: 4G modem / WiFi sinyali kontrol.
- Network seviyesinde: nginx
/ocpp/access log -> charger IP traffic var mi?
Onleme: Charger basina health check (LWT MQTT) — OCPP'de spec disi, vendor-specific.
Senaryo 5: Yuksek MeterValues frequency -> Redis debounce karari
Belirti:
OcppMiddlewareTouchSlowfiring (P2)- Panel 6 (Latency Percentiles) p95 > 100ms sustained 24sa+
- Panel 9 (DB Pool) %50+ utilization sustained
- Charger fleet'inde Beny / yuksek-MeterValues frequency vendor cogunluk
Tani:
# MeterValues message rate / aktif charger
curl -s http://localhost:8000/metrics | grep -E \
"ocpp_messages_total{action=\"MeterValues\"|ocpp_ws_connections_active"
Hesap: rate(ocpp_messages_total{action="MeterValues"}[5m]) / sum(ocpp_ws_connections_active). 30 charger'da rate ~6/sn (her 5sn'de bir) = her charger 5sn'de MeterValues atiyor.
Mitigation (sonraki PR — bu PR'da degil!):
# Redis tabanli 30sn debounce — placeholder, sonraki PR
async def _touch_last_seen_debounced(self, action: str | None = None) -> None:
key = f"ocpp:last_seen_touch:{self._charger_id}"
# SET NX EX — sadece key yoksa set et, 30sn TTL
locked = await redis.set(key, "1", nx=True, ex=30)
if not locked:
# 30sn icinde zaten UPDATE yapildi, atla
return
await self._touch_last_seen(action=action)
Karar destegi:
- p95 latency > 50ms surekli (24sa+) -> debounce ekle
- DB pool utilization > %50 surekli -> debounce ekle
- Aksi takdirde mevcut basit middleware yeterli (her CALL'da UPDATE; PostgreSQL
UPDATE ... WHERE id = ?partial index ile pratikte 1-5ms).
Trade-off:
- Debounce 30sn pencerede tek UPDATE ->
last_seen_at30sn'e kadar stale olabilir. Watchdog 2*heartbeat_interval (default 120sn) esiginde isler -> 30sn stale absorbe edilir. - Debounce eklenirse:
ocpp_middleware_touch_total{result="success"}rate dususler — yeni baseline beklenir, alarm threshold'lari yeniden tune edilebilir.
5. Profile Sonrasi Karar (Redis Debounce Gerekli mi?)
Bu runbook'un en kritik karar destegi: C5 middleware'in for: 24h profile penceresi sonrasi Redis debounce eklenmesi gerekiyor mu?
Karar Matrisi
| Gozlem (24sa pencere) | Karar |
|---|---|
p95 latency <50ms sustained | Debounce GEREKMEZ — mevcut basit middleware yeterli. |
p95 latency 50-100ms sustained | Trend takip et — fleet buyumesi ile birlikte tekrar degerlendir. |
p95 latency >100ms sustained | Debounce ekle — sonraki PR. |
DB pool util <%30 sustained | Debounce GEREKMEZ |
DB pool util >%50 sustained | Debounce ekle — pool baski azaltir. |
OcppMiddlewareTouchSlow cikip soniyor (transient) | Pool size'i artir; debounce ekleme. |
OcppMiddlewareTouchSlow sustained 24sa+ | Debounce ekle. |
Profile Verisi Toplama
# 24sa boyunca toplanan p95 latency snapshot'lari
curl -s "http://prometheus:9090/api/v1/query_range?\
query=histogram_quantile(0.95,sum%20by%20(le)(rate(ocpp_middleware_touch_duration_seconds_bucket[5m])))&\
start=$(date -u -d '24 hours ago' +%s)&end=$(date -u +%s)&step=300" | \
jq '.data.result[0].values[] | .[1]' | sort -n | tail -10
# DB pool util max
curl -s "http://prometheus:9090/api/v1/query_range?\
query=max(zeus_db_pool_utilization)&\
start=$(date -u -d '24 hours ago' +%s)&end=$(date -u +%s)&step=300" | \
jq '.data.result[0].values[] | .[1]' | sort -n | tail -5
Karar Sonrasi
Debounce eklenirse:
- Sonraki PR'da
_touch_last_seen_debouncedimplementation. - Bu runbook'da P2 alert threshold (
>0.1->>0.5) revize edilir (debounce ile p95 dramaticly dusumeli). - Yeni metric:
ocpp_middleware_touch_debounced_total{result}— debounce ile atlanan UPDATE sayaci.
Debounce eklenmezse:
- 6 ay sonra tekrar profil — fleet buyudukce baski artar.
- DB pool baski olusursa Senaryo 1'e gec.
6. Regression Koruma
PR-B C8 sonrasi bu invariant'lar bozulmamali:
_touch_last_seenher CALL'da cagrilir —test_charge_point_middleware.pytest suite garanti eder.- Exception yutulur —
route_messageparent dispatch'i bozulmaz. - Rate-limited log —
_TOUCH_FAIL_LOG_INTERVAL_SEC = 60.0saniyede 30+ log spam'i engeller. - Metric cardinality —
result(2 deger) +tenant_id(~10 deger) disinda label EKLENMEZ. - Severity high (watchdog) —
OCPP_HEARTBEAT_TIMEOUTalarm severitymedium->high(C8 update).
Pre-deploy checklist
-
alembic current0043 veya sonrasi -
prometheus/rules/ocpp_charger_liveness.ymlPrometheus'da yuklendi -
grafana/dashboards/ocpp-charger-liveness.jsonGrafana'da gozukur - Celery beat
ocpp_refresh_charger_last_seen_lagtask'i 60sn'de bir calisir -
/metricsendpointocpp_middleware_touch_total,ocpp_middleware_touch_duration_seconds,ocpp_charger_last_seen_lag_secondsexport eder
Post-deploy 24sa canary
- P1/P2/P3 alarm'lari
firingdurumda DEGIL - Panel 1 (Success Rate)
>0.99 - Panel 7 (Lag by Tenant) tum tenant'lar
<300sn - Log'da
ocpp_last_seen_touch_failedevent'i nadir (saatte birkac'tan az)
7. Referanslar
- C5 commit (
97b3afe):ZeusChargePoint16.route_messageoverride + middleware - Migration 0043:
last_seen_at TIMESTAMPTZ NULL + ix_ocpp_chargers_last_seen_partial - C7 audit raporu — 4 MINOR: log spam, autovacuum, defansif suppress, action context
- C8 (bu PR): Observability — metric + alert + dashboard + runbook + severity high
- Dashboard:
grafana/dashboards/ocpp-charger-liveness.json - Alert dosyasi:
prometheus/rules/ocpp_charger_liveness.yml - Test suite:
backend/tests/features/ocpp/test_charge_point_middleware.py,backend/tests/features/ocpp/test_heartbeat_watchdog.py