Ana içeriğe geç

OCPP MeterValues Persist/Broadcast Failure Runbook

Sahip: backend on-call (yedek: backend-systems-architect) Ilgili dashboard: Grafana > OCPP > OCPP MeterValues Reliability (uid: ocpp-meter-values) Ilgili alert dosyasi: prometheus/rules/ocpp_meter_values.yml Backend kaynak: backend/app/core/ocpp/v16/handlers/core.py::on_meter_values (commit 6b3a022) Migration referansi: b8c474b — telemetry_charger_meter PK 4-tuple Son guncelleme: 2026-05-11 (PR-B observability gap kapanisi)


0. Bu Runbook Neden Var?

Production incident (2026-05-11): Backend log'unda ocpp_meter_values_failed event'i 11+ dakika boyunca her saniye dustu fakat HICBIR alarm tetiklenmedi. Frontend LiveMeterChart UI olu kaldi (persist+broadcast ayni try'da idi, ikisi de fail). Saha mudahalesi kullanici sikayetinden sonra basladi.

Root cause:

  • Handler refactor oncesi: on_meter_values icinde persist + broadcast TEK bir try'da idi. Persist patlayinca outer try yutuyordu -> broadcast bile denemiyordu.
  • Migration 0042 oncesi: PK (time, charger_id, measurand) 3-tuple idi. 3-fazlı Beny charger'lar ayni measurand'i farkli phase'lerle gonderdiginde IntegrityError -> outer try yutar -> sessiz fail.

Bu PR-B'de cozulen:

  • Migration 0042 (b8c474b): PK -> (time, charger_id, measurand, phase). 3-fazlı payload'lar conflict yaratmaz.
  • Handler refactor (6b3a022): persist + broadcast AYRI try/except bloklarinda. Persist fail -> broadcast yine denenir. Yeni metric semantigi: MeterValues_persist/MeterValues_broadcast ayri counter'lari + status=partial semantigi.
  • Alert kurallari (bu PR, prometheus/rules/ocpp_meter_values.yml): 4 alert seviyesi P1-P4.

1. Tetikleyici Alert'ler

Bu runbook'a yonlendiren alert'ler:

AlertSeverityforEsikAnlami
OcppMeterValuesPersistErrorRateHighcritical5mrate > 0Persist generic Exception. Production-breaking.
OcppMeterValuesPartialPersistwarning10mrate > 0.1/snPersist FAIL ama broadcast OK. Tarihsel veri kaybi.
OcppMeterValuesBroadcastErrorwarning5mrate > 0publish_meter_tick crash. UI etkilenir.
OcppMeterValuesPersistConflictSpikeinfo15mrate > 1/snIntegrityError (PK conflict). Vendor anomaly.

Tum alert'ler component=ocpp label'i ile tag'lenir. Routing: critical -> page, warning -> ticket, info -> Grafana annotation + Slack thread.


2. Hizli Tani (5 dakika)

Adim 1: Dashboard'i ac

Grafana > OCPP klasoru > OCPP MeterValues Reliability:

  • URL: https://grafana.<host>/d/ocpp-meter-values
  • Default time range: son 3 saat

Adim 2: 5 panel'i sirayla kontrol et

  1. Panel 1 (Persist Success Rate)>0.99 normal. <0.9 = sistemik sorun.
  2. Panel 3 (Persist Generic Error)>0 ise P1 alert tetigi aktif.
  3. Panel 4 (Broadcast Error)>0 ise P3 alert tetigi aktif.
  4. Panel 5 (Status Breakdown stacked) — partial veya error rengi gorunuyorsa hangi rate?
  5. Panel 6 (Persist Status Rate) — conflict mi error mi baskin?

Adim 3: Log ile cross-check

# Son 10dk persist hatalari
docker compose logs backend --since 10m 2>&1 | grep -E "ocpp_meter_values_(persist_failed|persist_conflict|failed)" | head -20

# Charger basina breakdown (cardinality safe — log uzerinden)
docker compose logs backend --since 30m 2>&1 \
| grep ocpp_meter_values_persist_failed \
| jq -r '.charger_id // .extra.charger_id' \
| sort | uniq -c | sort -rn | head -10

# Broadcast hatalari
docker compose logs backend --since 10m 2>&1 | grep ocpp_event_meter_publish_failed

Adim 4: Pattern karari

GozlemHipotezEylem
Persist error >0, tum charger'larDB connectivity veya schema-model drift§Senaryo 1 / §Senaryo 2
Persist error >0, tek chargerCharger'a ozel payload anomalyLogs deep dive
Persist conflict >0, single chargerVendor replay / NTP§Senaryo 4
Persist conflict >0, tum charger'larMigration 0042 gerilemis veya partial uygulanmamış§Senaryo 2
Broadcast error >0Redis pubsub veya WS Hub§Senaryo 3
Status=partial spike + error 0Persist katmani patladi ama outer try yedi§Senaryo 1 (persist deep dive)

3. Eskalasyon Matrisi

SeviyeKimNe zamanIletisim
L1Backend on-callHemenSlack #zeus-incidents + PagerDuty
L2backend-systems-architectL1 30dk icinde cozemezseSlack DM
L3CTOL2 1sa icinde cozemezse veya etki >100 chargerTelefon
P4 (info)Vendor entegrasyon ekibiconflict spike + tek chargerEmail ticket

P1 critical (OcppMeterValuesPersistErrorRateHigh):

  • 30dk icinde cozulmezse -> CTO escalation.
  • Eger schema-model drift (Senaryo 1) ise: devops-deployment-agent ile koordineli alembic upgrade head planla.

P2/P3 warning:

  • Backend on-call ticket -> 4 saatte triage.

4. Sik Gorulen Senaryolar

Senaryo 1: Migration kosmadi, schema-model drift

Belirti:

  • OcppMeterValuesPersistErrorRateHigh firing
  • Logs: asyncpg.NotNullViolationError: null value in column "phase" veya benzer schema mismatch
  • Persist Status Rate panel'inde error baskin, conflict neredeyse 0

Tani:

# Alembic current revision
docker compose exec backend alembic current
# Beklenen: en az 0042 (telemetry_charger_meter PK fix migration)

# Migration history
docker compose exec backend alembic history | head -10

# Manuel kontrol — telemetry_charger_meter PK
docker compose exec postgres psql -U zeus -d zeus -c "\d telemetry_charger_meter"

Cozum:

  1. Migration kosmamissa: docker compose exec backend alembic upgrade head
  2. Migration aktif ama model drift varsa: code review — app/features/charger/models/telemetry.py ile DB schema'sini karsilastir
  3. Deploy pipeline patlamis ise: devops-deployment-agent'a handoff (.github/workflows/deploy.yml line 553 civari migration step)

Onleme: CI'da migration smoke test (alembic upgrade head cleanroom DB) zorunlu — PR'da yokken birlestirme.


Senaryo 2: Vendor 3-fazli payload, eski PK

Belirti:

  • OcppMeterValuesPersistConflictSpike (P4) firing
  • Logs: asyncpg.UniqueViolationError on (time, charger_id, measurand) (eski 3-tuple PK belirten)
  • Sadece 3-fazli charger'larda

Tani:

# Migration 0042 uygulandi mi?
docker compose exec postgres psql -U zeus -d zeus -c "
SELECT constraint_name, column_name
FROM information_schema.key_column_usage
WHERE table_name = 'telemetry_charger_meter'
ORDER BY ordinal_position;
"
# Beklenen: time, charger_id, measurand, phase (4 kolon)
# Eski: time, charger_id, measurand (3 kolon) -> Migration eksik

# Charger model bazinda conflict count
docker compose logs backend --since 1h 2>&1 \
| grep ocpp_meter_values_persist_conflict \
| jq -r '.charger_id' | sort | uniq -c

Cozum:

  1. Migration 0042 uygulanmamissa: docker compose exec backend alembic upgrade head
  2. Migration uygulanmis ama PK hala 3-tuple: migration-integrity-sentinel handoff — migration corruption olabilir
  3. Cozum sonrasi 30dk gozlem -> conflict rate 0'a inmeli

Senaryo 3: Broadcast Redis bagimlilik problemi

Belirti:

  • OcppMeterValuesBroadcastError (P3) firing
  • Logs: ocpp_event_meter_publish_failed + RedisConnectionError / ConnectionResetError
  • Frontend LiveMeterChart UI canli veri gostermez (kullanici sikayeti)
  • Persist tarafi saglikli (panel 5'te success baskin, partial/error minimal)

Tani:

# Redis health
docker compose exec redis redis-cli PING
# Beklenen: PONG

# Redis pubsub channel sayisi
docker compose exec redis redis-cli PUBSUB CHANNELS "ocpp:*" | wc -l

# Redis connection sayisi
docker compose exec redis redis-cli INFO clients

# WS Hub backend log
docker compose logs backend --since 15m 2>&1 | grep -E "ws_hub|websocket" | tail -50

Cozum:

  1. Redis kapali / unreachable: docker compose restart redis (DevOps handoff)
  2. Connection pool exhaustion: backend connection limit kontrolu (backend/app/core/redis/client.py)
  3. WS Hub crash: backend restart degerlendir (docker compose restart backend); kullanici impact yok ama persist tarafi etkilenmez
  4. Cozum sonrasi 5dk gozlem -> broadcast error rate 0'a inmeli

Senaryo 4: PR-B incident pattern (REFERANS)

Belirti (2026-05-11 production):

  • ocpp_meter_values_failed log'u her saniye dustu (11+ dakika)
  • HICBIR alert tetiklenmedi (eski metric setiyle MeterValues_persist/broadcast yoktu, sadece MeterValues success/error vardi)
  • Frontend LiveMeterChart UI olu (broadcast da etkilendi cunku ayni try'daydi)
  • Saha mudahalesi kullanici sikayetinden sonra (~12 dakika sonra)

Root cause:

  • Migration 0041 oncesi: PK (time, charger_id, measurand) 3-tuple
  • Beny 3-fazli charger MeterValues payload'inda ayni measurand'i 3 farkli phase ile gonderdi
  • IntegrityError -> outer try except Exception yutuyordu -> _inc_metric("MeterValues", "in", "error") artirildı
  • AMA o tarihte MeterValues{status="error"} icin alert yoktu (ocpp.yml icinde sadece auth/timeout/heartbeat alarmlari vardi)

Cozum (uygulandi):

  1. Migration 0042 (b8c474b): PK 4-tuple — conflict ortadan kalkti
  2. Handler refactor (6b3a022): persist + broadcast ayri try/except, yeni metric semantigi
  3. Alert kurallari (bu PR): 4 seviyeli alarm — bu pattern sessiz kalamaz

Tekrar olmamasi icin onleme:

  • Yeni metric ekleyen her PR mutlaka alert rules + dashboard ile birlikte gelmeli (Code Audit Sentinel cardinality + alert coverage kontrolu yapmali)
  • Production canary: PR-B sonrasi 24-48h boyunca OcppMeterValuesPersistErrorRateHigh 0 olmali

5. Postmortem Kontrol Listesi

Eger P1 critical alert >30dk firing kaldıysa:

  • Incident timeline (alert firing, first response, fix, recovery)
  • Etki: kac charger, ne kadar veri kaybi (telemetry_charger_meter row count delta)
  • Root cause yazimi (5 whys)
  • Aksiyon maddeleri:
    • Yeni alert eklenmesi gerekiyor mu? (bu runbook kapsam disinda kaldıysa)
    • Migration / handler kodu degisikligi mi gerekti?
    • Cardinality / log volume problem mi yarattı?
  • Postmortem dokumani: docs/incidents/<YYYY-MM-DD>-meter-values-incident.md
  • CHANGELOG entry (api-documentation-architect handoff)

6. Ilgili Dokumanlar

  • Handler refactor: backend/app/core/ocpp/v16/handlers/core.py::on_meter_values (commit 6b3a022)
  • Migration: backend/alembic/versions/0042_*.py (commit b8c474b)
  • Tests: backend/tests/features/ocpp/test_meter_values_*.py (commit ed2440e)
  • Metric tanimlari: backend/app/core/observability/ocpp_metrics.py
  • Alert kurallari: prometheus/rules/ocpp_meter_values.yml
  • Dashboard: grafana/dashboards/ocpp-meter-values.json
  • Structured logging standardi: docs/observability/structured-logging.md (ozellikle ocpp_meter_values_* event'leri)
  • Diger OCPP runbook'lari: PR-E6 Dispatcher Canary