OCPP MeterValues Persist/Broadcast Failure Runbook
Sahip: backend on-call (yedek: backend-systems-architect)
Ilgili dashboard: Grafana > OCPP > OCPP MeterValues Reliability (uid: ocpp-meter-values)
Ilgili alert dosyasi: prometheus/rules/ocpp_meter_values.yml
Backend kaynak: backend/app/core/ocpp/v16/handlers/core.py::on_meter_values (commit 6b3a022)
Migration referansi: b8c474b — telemetry_charger_meter PK 4-tuple
Son guncelleme: 2026-05-11 (PR-B observability gap kapanisi)
0. Bu Runbook Neden Var?
Production incident (2026-05-11): Backend log'unda ocpp_meter_values_failed event'i 11+ dakika boyunca her saniye dustu fakat HICBIR alarm tetiklenmedi. Frontend LiveMeterChart UI olu kaldi (persist+broadcast ayni try'da idi, ikisi de fail). Saha mudahalesi kullanici sikayetinden sonra basladi.
Root cause:
- Handler refactor oncesi:
on_meter_valuesicinde persist + broadcast TEK bir try'da idi. Persist patlayinca outer try yutuyordu -> broadcast bile denemiyordu. - Migration 0042 oncesi: PK
(time, charger_id, measurand)3-tuple idi. 3-fazlı Beny charger'lar ayni measurand'i farkli phase'lerle gonderdigindeIntegrityError-> outer try yutar -> sessiz fail.
Bu PR-B'de cozulen:
- Migration 0042 (
b8c474b): PK ->(time, charger_id, measurand, phase). 3-fazlı payload'lar conflict yaratmaz. - Handler refactor (
6b3a022): persist + broadcast AYRI try/except bloklarinda. Persist fail -> broadcast yine denenir. Yeni metric semantigi:MeterValues_persist/MeterValues_broadcastayri counter'lari +status=partialsemantigi. - Alert kurallari (bu PR,
prometheus/rules/ocpp_meter_values.yml): 4 alert seviyesi P1-P4.
1. Tetikleyici Alert'ler
Bu runbook'a yonlendiren alert'ler:
| Alert | Severity | for | Esik | Anlami |
|---|---|---|---|---|
OcppMeterValuesPersistErrorRateHigh | critical | 5m | rate > 0 | Persist generic Exception. Production-breaking. |
OcppMeterValuesPartialPersist | warning | 10m | rate > 0.1/sn | Persist FAIL ama broadcast OK. Tarihsel veri kaybi. |
OcppMeterValuesBroadcastError | warning | 5m | rate > 0 | publish_meter_tick crash. UI etkilenir. |
OcppMeterValuesPersistConflictSpike | info | 15m | rate > 1/sn | IntegrityError (PK conflict). Vendor anomaly. |
Tum alert'ler component=ocpp label'i ile tag'lenir. Routing: critical -> page, warning -> ticket, info -> Grafana annotation + Slack thread.
2. Hizli Tani (5 dakika)
Adim 1: Dashboard'i ac
Grafana > OCPP klasoru > OCPP MeterValues Reliability:
- URL:
https://grafana.<host>/d/ocpp-meter-values - Default time range: son 3 saat
Adim 2: 5 panel'i sirayla kontrol et
- Panel 1 (Persist Success Rate) —
>0.99normal.<0.9= sistemik sorun. - Panel 3 (Persist Generic Error) —
>0ise P1 alert tetigi aktif. - Panel 4 (Broadcast Error) —
>0ise P3 alert tetigi aktif. - Panel 5 (Status Breakdown stacked) — partial veya error rengi gorunuyorsa hangi rate?
- Panel 6 (Persist Status Rate) — conflict mi error mi baskin?
Adim 3: Log ile cross-check
# Son 10dk persist hatalari
docker compose logs backend --since 10m 2>&1 | grep -E "ocpp_meter_values_(persist_failed|persist_conflict|failed)" | head -20
# Charger basina breakdown (cardinality safe — log uzerinden)
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_meter_values_persist_failed \
| jq -r '.charger_id // .extra.charger_id' \
| sort | uniq -c | sort -rn | head -10
# Broadcast hatalari
docker compose logs backend --since 10m 2>&1 | grep ocpp_event_meter_publish_failed
Adim 4: Pattern karari
| Gozlem | Hipotez | Eylem |
|---|---|---|
| Persist error >0, tum charger'lar | DB connectivity veya schema-model drift | §Senaryo 1 / §Senaryo 2 |
| Persist error >0, tek charger | Charger'a ozel payload anomaly | Logs deep dive |
| Persist conflict >0, single charger | Vendor replay / NTP | §Senaryo 4 |
| Persist conflict >0, tum charger'lar | Migration 0042 gerilemis veya partial uygulanmamış | §Senaryo 2 |
| Broadcast error >0 | Redis pubsub veya WS Hub | §Senaryo 3 |
| Status=partial spike + error 0 | Persist katmani patladi ama outer try yedi | §Senaryo 1 (persist deep dive) |
3. Eskalasyon Matrisi
| Seviye | Kim | Ne zaman | Iletisim |
|---|---|---|---|
| L1 | Backend on-call | Hemen | Slack #zeus-incidents + PagerDuty |
| L2 | backend-systems-architect | L1 30dk icinde cozemezse | Slack DM |
| L3 | CTO | L2 1sa icinde cozemezse veya etki >100 charger | Telefon |
| P4 (info) | Vendor entegrasyon ekibi | conflict spike + tek charger | Email ticket |
P1 critical (OcppMeterValuesPersistErrorRateHigh):
- 30dk icinde cozulmezse -> CTO escalation.
- Eger schema-model drift (Senaryo 1) ise:
devops-deployment-agentile koordinelialembic upgrade headplanla.
P2/P3 warning:
- Backend on-call ticket -> 4 saatte triage.
4. Sik Gorulen Senaryolar
Senaryo 1: Migration kosmadi, schema-model drift
Belirti:
OcppMeterValuesPersistErrorRateHighfiring- Logs:
asyncpg.NotNullViolationError: null value in column "phase"veya benzer schema mismatch - Persist Status Rate panel'inde
errorbaskin,conflictneredeyse 0
Tani:
# Alembic current revision
docker compose exec backend alembic current
# Beklenen: en az 0042 (telemetry_charger_meter PK fix migration)
# Migration history
docker compose exec backend alembic history | head -10
# Manuel kontrol — telemetry_charger_meter PK
docker compose exec postgres psql -U zeus -d zeus -c "\d telemetry_charger_meter"
Cozum:
- Migration kosmamissa:
docker compose exec backend alembic upgrade head - Migration aktif ama model drift varsa: code review —
app/features/charger/models/telemetry.pyile DB schema'sini karsilastir - Deploy pipeline patlamis ise:
devops-deployment-agent'a handoff (.github/workflows/deploy.ymlline 553 civari migration step)
Onleme: CI'da migration smoke test (alembic upgrade head cleanroom DB) zorunlu — PR'da yokken birlestirme.
Senaryo 2: Vendor 3-fazli payload, eski PK
Belirti:
OcppMeterValuesPersistConflictSpike(P4) firing- Logs:
asyncpg.UniqueViolationError on (time, charger_id, measurand)(eski 3-tuple PK belirten) - Sadece 3-fazli charger'larda
Tani:
# Migration 0042 uygulandi mi?
docker compose exec postgres psql -U zeus -d zeus -c "
SELECT constraint_name, column_name
FROM information_schema.key_column_usage
WHERE table_name = 'telemetry_charger_meter'
ORDER BY ordinal_position;
"
# Beklenen: time, charger_id, measurand, phase (4 kolon)
# Eski: time, charger_id, measurand (3 kolon) -> Migration eksik
# Charger model bazinda conflict count
docker compose logs backend --since 1h 2>&1 \
| grep ocpp_meter_values_persist_conflict \
| jq -r '.charger_id' | sort | uniq -c
Cozum:
- Migration 0042 uygulanmamissa:
docker compose exec backend alembic upgrade head - Migration uygulanmis ama PK hala 3-tuple:
migration-integrity-sentinelhandoff — migration corruption olabilir - Cozum sonrasi 30dk gozlem -> conflict rate 0'a inmeli
Senaryo 3: Broadcast Redis bagimlilik problemi
Belirti:
OcppMeterValuesBroadcastError(P3) firing- Logs:
ocpp_event_meter_publish_failed+RedisConnectionError/ConnectionResetError - Frontend LiveMeterChart UI canli veri gostermez (kullanici sikayeti)
- Persist tarafi saglikli (panel 5'te
successbaskin,partial/errorminimal)
Tani:
# Redis health
docker compose exec redis redis-cli PING
# Beklenen: PONG
# Redis pubsub channel sayisi
docker compose exec redis redis-cli PUBSUB CHANNELS "ocpp:*" | wc -l
# Redis connection sayisi
docker compose exec redis redis-cli INFO clients
# WS Hub backend log
docker compose logs backend --since 15m 2>&1 | grep -E "ws_hub|websocket" | tail -50
Cozum:
- Redis kapali / unreachable:
docker compose restart redis(DevOps handoff) - Connection pool exhaustion: backend connection limit kontrolu (
backend/app/core/redis/client.py) - WS Hub crash: backend restart degerlendir (
docker compose restart backend); kullanici impact yok ama persist tarafi etkilenmez - Cozum sonrasi 5dk gozlem -> broadcast error rate 0'a inmeli
Senaryo 4: PR-B incident pattern (REFERANS)
Belirti (2026-05-11 production):
ocpp_meter_values_failedlog'u her saniye dustu (11+ dakika)- HICBIR alert tetiklenmedi (eski metric setiyle
MeterValues_persist/broadcastyoktu, sadeceMeterValuessuccess/errorvardi) - Frontend LiveMeterChart UI olu (broadcast da etkilendi cunku ayni try'daydi)
- Saha mudahalesi kullanici sikayetinden sonra (~12 dakika sonra)
Root cause:
- Migration 0041 oncesi: PK
(time, charger_id, measurand)3-tuple - Beny 3-fazli charger MeterValues payload'inda ayni measurand'i 3 farkli phase ile gonderdi
IntegrityError-> outer tryexcept Exceptionyutuyordu ->_inc_metric("MeterValues", "in", "error")artirildı- AMA o tarihte
MeterValues{status="error"}icin alert yoktu (ocpp.ymlicinde sadece auth/timeout/heartbeat alarmlari vardi)
Cozum (uygulandi):
- Migration 0042 (
b8c474b): PK 4-tuple — conflict ortadan kalkti - Handler refactor (
6b3a022): persist + broadcast ayri try/except, yeni metric semantigi - Alert kurallari (bu PR): 4 seviyeli alarm — bu pattern sessiz kalamaz
Tekrar olmamasi icin onleme:
- Yeni metric ekleyen her PR mutlaka alert rules + dashboard ile birlikte gelmeli (Code Audit Sentinel cardinality + alert coverage kontrolu yapmali)
- Production canary: PR-B sonrasi 24-48h boyunca
OcppMeterValuesPersistErrorRateHigh0 olmali
5. Postmortem Kontrol Listesi
Eger P1 critical alert >30dk firing kaldıysa:
- Incident timeline (alert firing, first response, fix, recovery)
- Etki: kac charger, ne kadar veri kaybi (telemetry_charger_meter row count delta)
- Root cause yazimi (5 whys)
- Aksiyon maddeleri:
- Yeni alert eklenmesi gerekiyor mu? (bu runbook kapsam disinda kaldıysa)
- Migration / handler kodu degisikligi mi gerekti?
- Cardinality / log volume problem mi yarattı?
- Postmortem dokumani:
docs/incidents/<YYYY-MM-DD>-meter-values-incident.md - CHANGELOG entry (api-documentation-architect handoff)
6. Ilgili Dokumanlar
- Handler refactor:
backend/app/core/ocpp/v16/handlers/core.py::on_meter_values(commit6b3a022) - Migration:
backend/alembic/versions/0042_*.py(commitb8c474b) - Tests:
backend/tests/features/ocpp/test_meter_values_*.py(commited2440e) - Metric tanimlari:
backend/app/core/observability/ocpp_metrics.py - Alert kurallari:
prometheus/rules/ocpp_meter_values.yml - Dashboard:
grafana/dashboards/ocpp-meter-values.json - Structured logging standardi:
docs/observability/structured-logging.md(ozellikle ocpp_meter_values_* event'leri) - Diger OCPP runbook'lari: PR-E6 Dispatcher Canary