โ† Retour au cours

๐ŸŽฏ Phase 5 - Specialisation HA/DR/SRE

Cheatsheet Architecture Systeme | High Availability โ€ข Disaster Recovery โ€ข Site Reliability Engineering

๐Ÿ“Š SLA / SLO / SLI

Definitions

SLIs Communs

Availability = (Uptime / Total Time) ร— 100

๐ŸŽฏ Niveaux de Disponibilite

Niveau Uptime Downtime/an
99% Two 9s 3.65 jours
99.9% Three 9s 8.76 heures
99.95% Three & half 4.38 heures
99.99% Four 9s 52.6 minutes
99.999% Five 9s 5.26 minutes
๐Ÿ’ก Chaque "9" supplementaire = 10x plus difficile

๐Ÿ’ฐ Error Budget

Error Budget = 100% - SLO

Exemple SLO 99.9%

Regles d'utilisation

โš ๏ธ Ne jamais depasser l'error budget volontairement

๐Ÿ”„ Patterns High Availability

Redundance

Niveaux de Redundance

๐ŸŽฏ Regle: Eliminer tout SPOF (Single Point of Failure)

โšก Strategies de Failover

Type RTO Cout
Cold Standby Heures $
Warm Standby Minutes $$
Hot Standby Secondes $$$
Active-Active ~0 $$$$

Metriques Cles

๐Ÿ†˜ Disaster Recovery

Strategies DR

Types de Backup

Regle 3-2-1: 3 copies, 2 medias, 1 offsite

๐Ÿ”ฅ Chaos Engineering

Principes

Types d'Experiences

Outils

Chaos Monkey | Gremlin | LitmusChaos

๐Ÿ“ˆ Metriques Performance

Latency Percentiles

Throughput

๐ŸŽฏ Toujours monitorer P99, pas seulement moyenne

๐Ÿงช Tests de Charge

Types de Tests

Outils

k6 | JMeter | Gatling | Locust

Exemple k6

export default function() { http.get('https://api.example.com'); } // k6 run --vus 100 --duration 30s

๐Ÿ› ๏ธ Principes SRE

Les 7 Principes

  • Operations = probleme software
  • SLOs et Error Budgets
  • Eliminer le toil (travail repetitif)
  • Automatiser tout
  • Mesurer tout
  • Blameless Post-mortems
  • On-call sustainable

Metriques SRE

  • Toil: < 50% du temps
  • Error Budget: consumption rate
  • MTTR: temps de resolution
  • Change Failure Rate: % echecs
  • Deployment Frequency: frequence
Toil = Manuel + Repetitif + Automatisable + Tactique

๐Ÿšจ Gestion d'Incidents

Severites

Sev Impact Response
SEV1 Critique, tous users 15 min
SEV2 Major, beaucoup users 30 min
SEV3 Minor, quelques users 4h
SEV4 Low, cosmetic 24h

Roles Incident

๐Ÿ“ Post-Mortem Blameless

Structure

๐Ÿ’ก Focus sur les SYSTEMES, pas les personnes

Technique 5 Whys

Pourquoi? โ†’ Pourquoi? โ†’ Pourquoi? โ†’ Pourquoi? โ†’ Root Cause

๐Ÿ”„ CI/CD Pipeline

Etapes CI

Commit โ†’ Build โ†’ Test โ†’ Scan โ†’ Package

Etapes CD

Deploy Dev โ†’ Test โ†’ Stage โ†’ Prod

DORA Metrics

Metric Elite High
Deploy Freq Multiple/jour 1/semaine
Lead Time < 1 heure < 1 semaine

๐Ÿ“ฆ GitOps

Principes

Workflow

PR โ†’ Review โ†’ Merge โ†’ Auto-Deploy

Outils

๐Ÿ’ก "If it's not in Git, it doesn't exist"

๐Ÿš€ Strategies de Deploiement

Strategy Risque Rollback
Recreate Haut Lent
Rolling Moyen Moyen
Blue/Green Bas Instant
Canary Tres bas Instant

Canary Deployment

1% โ†’ 5% โ†’ 25% โ†’ 50% โ†’ 100% (si metriques OK a chaque etape)

๐Ÿ‘๏ธ Observabilite

Les 3 Piliers

Outils

Pilier Outils
Logs ELK, Loki, Splunk
Metrics Prometheus, Datadog
Traces Jaeger, Zipkin

Golden Signals

Latency | Traffic | Errors | Saturation

๐Ÿ—๏ธ Platform Engineering

Internal Developer Platform

Composants IDP

Platform = Enable Devs + Reduce Cognitive Load

โŒจ๏ธ Commandes Essentielles

Kubernetes

kubectl get pods -A # Tous les pods kubectl describe pod # Details pod kubectl logs -f # Logs en temps reel kubectl rollout restart deploy # Restart deployment kubectl rollout undo deploy # Rollback kubectl top pods # Resource usage

Monitoring

# Prometheus queries (PromQL) rate(http_requests_total[5m]) # Requests/sec histogram_quantile(0.99, rate(...)) # P99 latency sum by (status)(http_requests_total) # Group by status # Alertmanager amtool alert query # Active alerts
Formation Architecte Systeme - Phase 5 Specialisation | v1.0 | 2026