๐ SLA / SLO / SLI
Definitions
- SLI = Service Level Indicator (mesure)
- SLO = Service Level Objective (cible)
- SLA = Service Level Agreement (contrat)
SLIs Communs
- Availability: % uptime
- Latency: P50, P95, P99
- Throughput: req/sec
- Error Rate: % erreurs
Availability = (Uptime / Total Time) ร 100
๐ฏ Niveaux de Disponibilite
| Niveau |
Uptime |
Downtime/an |
| 99% |
Two 9s |
3.65 jours |
| 99.9% |
Three 9s |
8.76 heures |
| 99.95% |
Three & half |
4.38 heures |
| 99.99% |
Four 9s |
52.6 minutes |
| 99.999% |
Five 9s |
5.26 minutes |
๐ก Chaque "9" supplementaire = 10x plus difficile
๐ฐ Error Budget
Error Budget = 100% - SLO
Exemple SLO 99.9%
- Error Budget = 0.1%
- = 43.8 min/mois de downtime
- = 8.76h/an autorisees
Regles d'utilisation
- Budget > 50%: deploiements agressifs
- Budget 20-50%: deploiements normaux
- Budget < 20%: prudence maximale
- Budget epuise: STOP deployments
โ ๏ธ Ne jamais depasser l'error budget volontairement
๐ Patterns High Availability
Redundance
- Active-Passive: 1 actif, N standby
- Active-Active: tous actifs, LB
- N+1: capacite + 1 spare
- 2N: double capacite totale
Niveaux de Redundance
- Multi-Instance: meme serveur
- Multi-Server: meme rack
- Multi-AZ: meme region
- Multi-Region: geographique
๐ฏ Regle: Eliminer tout SPOF (Single Point of Failure)
โก Strategies de Failover
| Type |
RTO |
Cout |
| Cold Standby |
Heures |
$ |
| Warm Standby |
Minutes |
$$ |
| Hot Standby |
Secondes |
$$$ |
| Active-Active |
~0 |
$$$$ |
Metriques Cles
- RTO = Recovery Time Objective
- RPO = Recovery Point Objective
- MTTR = Mean Time To Recovery
- MTBF = Mean Time Between Failures
๐ Disaster Recovery
Strategies DR
- Backup/Restore: RPO heures, RTO jours
- Pilot Light: RPO minutes, RTO heures
- Warm Standby: RPO secondes, RTO min
- Multi-Site: RPO ~0, RTO ~0
Types de Backup
- Full: copie complete
- Incremental: depuis dernier backup
- Differential: depuis dernier full
Regle 3-2-1: 3 copies, 2 medias, 1 offsite
๐ฅ Chaos Engineering
Principes
- Definir l'etat stable (hypothese)
- Varier les evenements du monde reel
- Executer en production
- Automatiser et iterer
- Minimiser le blast radius
Types d'Experiences
- Instance: tuer des serveurs
- Network: latence, partitions
- Application: exceptions, timeouts
- Infrastructure: AZ failure
Outils
Chaos Monkey | Gremlin | LitmusChaos
๐ Metriques Performance
Latency Percentiles
- P50: mediane (50% plus rapide)
- P95: 95% des requetes
- P99: worst case realiste
- P99.9: tail latency
Throughput
- RPS: Requests Per Second
- TPS: Transactions Per Second
- QPS: Queries Per Second
๐ฏ Toujours monitorer P99, pas seulement moyenne
๐งช Tests de Charge
Types de Tests
- Load Test: charge normale
- Stress Test: au-dela limites
- Spike Test: pics soudains
- Soak Test: longue duree
- Breakpoint: trouver limite
Outils
k6 | JMeter | Gatling | Locust
Exemple k6
export default function() {
http.get('https://api.example.com');
}
// k6 run --vus 100 --duration 30s
๐ ๏ธ Principes SRE
Les 7 Principes
- Operations = probleme software
- SLOs et Error Budgets
- Eliminer le toil (travail repetitif)
- Automatiser tout
- Mesurer tout
- Blameless Post-mortems
- On-call sustainable
Metriques SRE
- Toil: < 50% du temps
- Error Budget: consumption rate
- MTTR: temps de resolution
- Change Failure Rate: % echecs
- Deployment Frequency: frequence
Toil = Manuel + Repetitif + Automatisable + Tactique
๐จ Gestion d'Incidents
Severites
| Sev |
Impact |
Response |
| SEV1 |
Critique, tous users |
15 min |
| SEV2 |
Major, beaucoup users |
30 min |
| SEV3 |
Minor, quelques users |
4h |
| SEV4 |
Low, cosmetic |
24h |
Roles Incident
- IC: Incident Commander
- Comms: Communication Lead
- Ops: Operations Lead
๐ Post-Mortem Blameless
Structure
- Summary: quoi, quand, impact
- Timeline: chronologie precise
- Root Cause: 5 Whys analysis
- Impact: users, revenue, SLO
- Action Items: corrective/preventive
- Lessons Learned: ce qu'on a appris
๐ก Focus sur les SYSTEMES, pas les personnes
Technique 5 Whys
Pourquoi? โ Pourquoi? โ Pourquoi? โ Pourquoi? โ Root Cause
๐ CI/CD Pipeline
Etapes CI
Commit โ Build โ Test โ Scan โ Package
Etapes CD
Deploy Dev โ Test โ Stage โ Prod
DORA Metrics
- Deployment Frequency: freq deploys
- Lead Time: commit โ prod
- MTTR: temps recovery
- Change Failure Rate: % rollbacks
| Metric |
Elite |
High |
| Deploy Freq |
Multiple/jour |
1/semaine |
| Lead Time |
< 1 heure |
< 1 semaine |
๐ฆ GitOps
Principes
- Declaratif: infra as code
- Versioned: Git = source of truth
- Automatise: sync automatique
- Auditable: historique complet
Workflow
PR โ Review โ Merge โ Auto-Deploy
Outils
- ArgoCD: GitOps pour K8s
- Flux: GitOps toolkit
- Terraform: IaC provider-agnostic
๐ก "If it's not in Git, it doesn't exist"
๐ Strategies de Deploiement
| Strategy |
Risque |
Rollback |
| Recreate |
Haut |
Lent |
| Rolling |
Moyen |
Moyen |
| Blue/Green |
Bas |
Instant |
| Canary |
Tres bas |
Instant |
Canary Deployment
1% โ 5% โ 25% โ 50% โ 100%
(si metriques OK a chaque etape)
๐๏ธ Observabilite
Les 3 Piliers
- Logs: evenements discrets
- Metrics: donnees numeriques
- Traces: parcours requetes
Outils
| Pilier |
Outils |
| Logs |
ELK, Loki, Splunk |
| Metrics |
Prometheus, Datadog |
| Traces |
Jaeger, Zipkin |
Golden Signals
Latency | Traffic | Errors | Saturation
๐๏ธ Platform Engineering
Internal Developer Platform
- Self-service: devs autonomes
- Golden Paths: templates standards
- Abstractions: cacher complexite
- Guardrails: securite by default
Composants IDP
- Service Catalog (Backstage)
- Infrastructure Templates
- CI/CD Pipelines
- Observability Stack
Platform = Enable Devs + Reduce Cognitive Load
โจ๏ธ Commandes Essentielles
Kubernetes
kubectl get pods -A # Tous les pods
kubectl describe pod
# Details pod
kubectl logs -f # Logs en temps reel
kubectl rollout restart deploy # Restart deployment
kubectl rollout undo deploy # Rollback
kubectl top pods # Resource usage
Monitoring
# Prometheus queries (PromQL)
rate(http_requests_total[5m]) # Requests/sec
histogram_quantile(0.99, rate(...)) # P99 latency
sum by (status)(http_requests_total) # Group by status
# Alertmanager
amtool alert query # Active alerts
Formation Architecte Systeme - Phase 5 Specialisation | v1.0 | 2026