Ana içeriğe geç

OCPP Auth List Push Silent-Fail

Semptomlar

  • Operator/CSO panelinden "RFID kart cihaza gonderildi" mesaji aliniyor ama saha geri donusu "kart cihazda yok" / "ekrandan giremiyorum" seklinde
  • Dashboard'da OcppAuthListPushEmptyTargets veya OcppAuthListPushHighExceptionRate alarmi aktif
  • OcppCommandLog tablosunda action='SendLocalList' icin son saatte beklenen sayida kayit yok
  • Tenant-wide push butonu basildiktan sonra response'da dispatched_count=0, warning='no_chargers_resolved' veya tenant_filter_dropped_N goruluyor

Olasi Sebepler

  1. UI bug — frontend bos charger_ids ile submit ediyor (target=specific ama liste bos)
  2. Tenant filter drift — kullanici tenant disi UUID'leri secti, backend sustu (tenant_filter_dropped_N uyarisi)
  3. Concurrent push (CAS drift) — eski admin paneli + yeni panel ayni tenant uzerinde es zamanli push gonderiyor; CAS update 0 row affected verip drift logluyor
  4. Charger toplu offline — registry'de owner yok, push hicbir charger'a ulasmadan offline counter'a yansiyor (silent-fail degil ama operator "ulasti" sandi)
  5. Dispatcher / pubsub backlog_send_ocpp_command wrapper'i timeout donduruyor; OcppCommandLog log_status='timeout' ile kayit gecer
  6. Backend exceptionpush_tenant_auth_list icinde uncaught hata (DB connection, sqlalchemy session expire vb.)

Teshis Adimlari

1. Alarm Sebebini Daralt

# Hangi metric/result kombosu tetikledi:
# Grafana'da "Auth List Push Attempts" panel'inde son 1 saatteki
# {mode, result} kirilimini kontrol et.
#
# Prometheus query:
# sum by (mode, result) (rate(ocpp_auth_list_push_attempts_total[1h]))
#
# Beklenen:
# - result=dispatched > 0 -> akis saglikli
# - result=empty cogunluk -> UI bug veya tenant filter sorunu
# - result=exception > 0 -> backend uncaught error

2. OcppCommandLog Audit Kayitlarini Sorgula

-- Son 1 saatte tenant icin SendLocalList denemeleri (audit kanit).
SELECT
id,
charger_id,
user_id,
log_status, -- accepted | rejected | timeout | error
error_code,
error_description,
latency_ms,
created_at
FROM ocpp_command_log
WHERE tenant_id = '<TENANT_UUID>'
AND action = 'SendLocalList'
AND created_at > now() - interval '1 hour'
ORDER BY created_at DESC
LIMIT 50;
  • Hic kayit yok -> wrapper hic cagrilmamis (silent-fail kesin: empty target veya schema validation error).
  • Sadece log_status='timeout' -> dispatcher/charger reply lag, ocpp_command_timeouts_total{timeout_kind} breakdown'una bak.
  • log_status='rejected' cogunluk -> charger SendLocalList reject ediyor (firmware sorunu / listVersion drift olabilir).

3. AuditLog Tenant Push Kaydini Kontrol Et

-- push_tenant_auth_list her cagrida bir audit kaydi yazar; payload'da
-- requested/resolved/dispatched/accepted/rejected/offline/error/warning
-- ozeti yer alir.
SELECT
id,
actor_user_id,
payload,
created_at
FROM audit_log
WHERE tenant_id = '<TENANT_UUID>'
AND action = 'ocpp.auth_list.tenant_pushed'
AND created_at > now() - interval '1 hour'
ORDER BY created_at DESC
LIMIT 20;
  • payload->>'requested_count' > 0 AND payload->>'resolved_count' = 0 -> tenant filter drop. Cosumer (frontend) tenant disi UUID gonderdi.
  • payload->>'resolved_count' > 0 AND payload->>'dispatched_count' = 0 -> tum charger'lar offline. Registry kontrol et.
  • payload->>'cas_drift_count' > 0 -> concurrent push var, eski admin paneli ariyor olabilir.

4. Per-Charger Outcome Dagilimi

# Grafana "Auth List Push Per-Charger Outcomes" panel'i (yoksa Prometheus
# manuel query):
# sum by (outcome) (rate(ocpp_auth_list_push_chargers_total[1h]))
#
# Beklenen tipik dagilim:
# accepted > 80%, offline %10-15, rejected < %5, timeout/error < %2
#
# error+timeout > %20 -> dispatcher/Redis pubsub backlog veya charger
# firmware reply problemi.

5. Frontend Payload Inceleme (Empty Pattern)

UI'dan tenant push tetiklendiginde browser devtools Network sekmesinde ilgili request'i ac:

  • Endpoint: POST /api/v1/ocpp/auth-list/push
  • Payload:
    {
    "target": "specific",
    "charger_ids": ["uuid1", "uuid2", ...]
    }
  • target="specific" AND charger_ids bos array -> frontend bug (selection state lost). UI ekibine acil ticket.
  • target="all" AND backend resolved_count=0 -> tenant'ta hic charger yok (yeni tenant) veya OcppCharger tablosu DB query'sinde tenant_id filter yanlis (backend regresyon).

6. CAS Drift Warning Loglarini Tara

# Backend structlog query — son 1 saatte ocpp_auth_list_cas_drift uyarisi:
docker compose logs backend --since 1h | grep ocpp_auth_list_cas_drift | tail -20

# Cikti ornek:
# {"event":"ocpp_auth_list_cas_drift","charger_id":"...","candidate_version":42,"expected_current":41,...}
#
# Cok sayida drift -> es zamanli push'lar var. Admin panelinde aktif
# session'lar ve son 5dk submit eden user_id'leri sorgula.
SELECT actor_user_id, created_at
FROM audit_log
WHERE action = 'ocpp.auth_list.tenant_pushed'
AND tenant_id = '<TENANT_UUID>'
AND created_at > now() - interval '5 minutes'
ORDER BY created_at DESC;

7. Backend Exception Trace (Exception Pattern)

# F4 refactor sonrasi outer try/finally exception path'i loglar:
docker compose logs backend --since 30m \
| grep -E "push_tenant_auth_list|ocpp_auth_list_push_charger_failed" \
| tail -50

# Genelde:
# - sqlalchemy.exc.OperationalError -> DB connection pool exhausted
# - sqlalchemy.exc.InvalidRequestError -> session expire race
# - asyncpg.exceptions.* -> DB protocol error

Mitigasyon

Geçici (5dk içinde uygulanabilir)

  1. Empty target pattern -> Operator'a yeni panelden tek tek charger secip push etmesini soyle (tenant-wide submit'i ertele); frontend ekibine bug raporu duş.

  2. Backend exception pattern -> docker compose restart backend. Bu CAS drift olusturmaz cunku wrapper kendi commit'lerini yapar; restart sonrası bir sonraki push temiz baslar.

  3. Dispatcher timeout cogunluk -> Eger PR-E6 fix devrede degilse docker compose restart backend; PR-E6 canary aktifse OcppDispatcherLoopErrors alarmlarini cross-check et (runbook: pre6-canary-monitoring).

Kalıcı

  1. Frontend empty submit guard — UI'dan boş charger_ids ile request gonderilmesini onle (button disabled state).
  2. Backend rate limiting — Ayni tenant icin 60sn'de en fazla 2 push (concurrent CAS drift'i azaltir).
  3. CAS drift > %5 isepush_tenant_auth_list icine pessimistic lock (transaction-level advisory lock) eklenmesi degerlendir.

Rollback Plani

F4 (push_tenant_auth_list wrapper refactor) PR'inin etkisi:

  • DB migration yok -> kolay revert.
  • Revert sonrasi _send_ocpp_command cagrisi kaybolur, eski direct send_command mantigi geri gelir -> silent-fail riski geri doner.
  • Revert yalnizca acil regresyon (orn. tum push'lar exception donuyorsa) durumunda dusunulmeli. Once OcppCommandLog audit'inden gercek hata pattern'ini belirle; F4 refactor yapisal olarak gerekiyor.

Komutlar:

# Revert (sadece acil durumda):
git revert <F4-commit-sha>
git push origin main
# CI yesil donduginde deploy.yml otomatik tetiklenir.

Dashboard

Grafana OCPP Charger Fleet dashboard'una eklenmesi onerilen panel'ler (henuz mevcut JSON'da YOK — manuel eklenecek):

  • Auth List Push Attempts (rate/dk) -> sum by (mode, result) (rate(ocpp_auth_list_push_attempts_total[5m])) * 60
  • Auth List Push Per-Charger Outcomes -> sum by (outcome) (rate(ocpp_auth_list_push_chargers_total[5m]))
  • Auth List Push Duration p50/p95/p99 -> histogram_quantile(0.95, sum by (le, mode) (rate(ocpp_auth_list_push_duration_seconds_bucket[5m])))

Bunlari eklemek icin: grafana/dashboards/ocpp-fleet.json -> yeni panel ID'leri 16-18 olarak ekle, gridPos y=47 baslangiclı 3 yan yana 6-genişlik panel.

Eskalasyon

Asagidaki durumlarda Backend ekibine eskalasyon yapin:

  • OcppAuthListPushHighExceptionRate 15dk+ devam ediyorsa
  • CAS drift count saatte > 50 (concurrent push storm'u)
  • OcppCommandLog log_status='error' error_code distribution'unda bilinmeyen kod (dispatcher_error, charger_not_found disinda) cikiyorsa
  • Rollback dusunuluyor ise (PR sahibi + on-call backend + DevOps onayi gereklidir)