๐Ÿ“š Ressources Phase 5 - Specialisation

High Availability, Disaster Recovery, Performance & Site Reliability Engineering

๐Ÿ“– Livres Essentiels

Site Reliability Engineering

Livre Gratuit Online

Le livre fondateur de Google sur les pratiques SRE. Couvre les SLOs, error budgets, toil, on-call, et l'automatisation. Reference absolue pour tout SRE.

Lire le livre โ†’

The Site Reliability Workbook

Livre Gratuit Online

Le guide pratique companion du SRE Book. Exercices concrets, case studies, et implementation details pour appliquer les principes SRE.

Lire le livre โ†’

Building Secure & Reliable Systems

Livre Gratuit Online

Comment integrer securite et fiabilite des le design. Pratiques Google pour des systemes robustes et securises.

Lire le livre โ†’

Designing Data-Intensive Applications

Livre

Par Martin Kleppmann. La bible des systemes distribues: replication, partitioning, consistency, fault tolerance. Indispensable.

Site officiel โ†’

Release It!

Livre

Par Michael Nygard. Design et deploy de production-ready software. Patterns de stabilite, anti-patterns, et lessons learned.

Pragmatic Bookshelf โ†’

Chaos Engineering

Livre

Par Casey Rosenthal et Nora Jones. System resiliency in practice. Comment implementer le chaos engineering dans votre organisation.

O'Reilly โ†’

๐Ÿ”ง Outils High Availability & DR

๐Ÿต

Chaos Monkey

Netflix chaos engineering

Open Source
๐Ÿ‘น

Gremlin

Chaos as a Service

๐Ÿงช

LitmusChaos

Chaos for Kubernetes

CNCF
โšก

Chaos Toolkit

Open chaos framework

Open Source
๐Ÿ”„

HAProxy

Load balancer HA

Open Source
๐ŸŒ

Keepalived

VRRP failover

Open Source
๐Ÿ’พ

Velero

Backup K8s clusters

CNCF
๐Ÿ—„๏ธ

Restic

Fast secure backups

Open Source

๐Ÿ“Š Outils Performance & Testing

๐Ÿš€

k6

Modern load testing

Open Source
โšก

Gatling

Scala-based testing

Open Source
๐Ÿ

Locust

Python load testing

Open Source
๐Ÿ”ฅ

Apache JMeter

Java performance

Apache
๐Ÿ“ˆ

Prometheus

Metrics & alerting

CNCF
๐Ÿ“Š

Grafana

Visualization

Open Source
๐Ÿ”

Jaeger

Distributed tracing

CNCF
๐Ÿ“

Loki

Log aggregation

Grafana Labs

๐Ÿ”„ Outils GitOps & SRE

๐ŸŽฏ

ArgoCD

GitOps for Kubernetes

CNCF
๐Ÿ”

Flux

GitOps toolkit

CNCF
๐Ÿ—๏ธ

Terraform

Infrastructure as Code

HashiCorp
๐ŸŽญ

Pulumi

IaC with real code

Open Source
๐Ÿ“Ÿ

PagerDuty

Incident management

๐Ÿšจ

Opsgenie

On-call & alerting

๐Ÿ“‹

Backstage

Developer portal

CNCF
๐ŸŽช

Port

Internal dev portal

๐ŸŽฌ Chaines YouTube Recommandees

โ–ถ๏ธ Google Cloud Tech

SRE talks, cloud architecture, best practices de Google. Series "SRE Classroom" excellente.

โ–ถ๏ธ CNCF

Cloud Native Computing Foundation. KubeCon talks, projets CNCF, cloud native patterns.

โ–ถ๏ธ HashiCorp

Terraform, Vault, Consul. Infrastructure as Code, secrets management, service mesh.

โ–ถ๏ธ DevOps Toolkit

Viktor Farcic. GitOps, Kubernetes, Crossplane, ArgoCD. Tutorials approfondis.

โ–ถ๏ธ TechWorld with Nana

DevOps, Kubernetes, Docker, CI/CD. Excellentes explications pour debutants et intermediaires.

โ–ถ๏ธ Rawkode Academy

Cloud native, Kubernetes, platform engineering. Streams et tutorials avances.

โ–ถ๏ธ That DevOps Guy

Marcel Dempers. Kubernetes deep dives, service mesh, GitOps implementations.

โ–ถ๏ธ Fireship

100 seconds explainers, comparatifs tech. Parfait pour comprendre rapidement les concepts.

๐ŸŽ“ Parcours de Certifications

CKA - Certified Kubernetes Administrator

Linux Foundation / CNCF

  • Administration clusters K8s
  • Networking, storage, security
  • Troubleshooting avance
  • Exam pratique (2h)
Intermediaire

CKS - Certified Kubernetes Security

Linux Foundation / CNCF

  • Securite des clusters
  • Supply chain security
  • Runtime security
  • Prerequis: CKA
Avance

AWS Solutions Architect Professional

Amazon Web Services

  • Architecture HA/DR
  • Migration strategies
  • Cost optimization
  • Multi-region design
Avance

GCP Professional Cloud Architect

Google Cloud

  • Design for reliability
  • SRE principles
  • Hybrid/multi-cloud
  • Security & compliance
Avance

HashiCorp Terraform Associate

HashiCorp

  • IaC fundamentals
  • Terraform workflow
  • Modules et state
  • Cloud-agnostic
Fondamental

GitOps Fundamentals

Codefresh / CNCF

  • GitOps principles
  • ArgoCD/Flux basics
  • Progressive delivery
  • Certification gratuite
Fondamental Gratuit

๐Ÿ“… Plan d'Etude Recommande - 12 Semaines

Sem 1-2

Module 5.1: High Availability Foundations

SLA/SLO/SLI, patterns de redundance, calculs de disponibilite. Lire chapitres 2-4 du SRE Book.

Sem 3-4

Module 5.1: Disaster Recovery

Strategies DR, RTO/RPO, backup strategies. Lab: implementer un plan DR avec Velero sur K8s.

Sem 5-6

Module 5.1: Chaos Engineering

Principes du chaos engineering, premiers game days. Lab: installer LitmusChaos et executer des experiences.

Sem 7-8

Module 5.2: Performance Testing

Metriques de performance, load testing avec k6, interpretation des resultats. Lab: profiler une application.

Sem 9-10

Module 5.3: SRE Practices

Error budgets, toil, incident management, post-mortems. Lire le SRE Workbook chapitres pratiques.

Sem 11-12

Module 5.3: GitOps & Platform Engineering

ArgoCD, Flux, internal developer platforms. Lab: deployer une app complete en GitOps avec ArgoCD.

๐Ÿ‘ฅ Communautes & Forums

๐Ÿ“ Blogs & Articles Incontournables

Google SRE Blog

Gratuit

Articles officiels de Google SRE team. Case studies, nouvelles pratiques, lessons learned des plus grands systemes.

Visiter โ†’

Netflix Tech Blog

Gratuit

Chaos engineering, resilience patterns, architecture a grande echelle. Pionniers du chaos engineering.

Visiter โ†’

AWS Architecture Blog

Gratuit

Patterns HA, multi-region, disaster recovery sur AWS. Reference architectures bien documentees.

Visiter โ†’

Gremlin Blog

Gratuit

Chaos engineering pratique, game day guides, failure injection patterns. Tres pedagogique.

Visiter โ†’

Platform Engineering

Gratuit

Communaute platform engineering. Best practices, tools, case studies d'Internal Developer Platforms.

Visiter โ†’

DORA Research

Gratuit

State of DevOps reports, DORA metrics research. Donnees scientifiques sur la performance DevOps.

Visiter โ†’

๐ŸŽค Conferences

SREcon

Videos

Conference USENIX dediee SRE. Talks avances, workshops, networking avec les meilleurs SREs du monde.

Voir les talks โ†’

KubeCon + CloudNativeCon

Videos

La plus grande conference cloud native. Kubernetes, observability, GitOps, service mesh.

YouTube CNCF โ†’

Chaos Conf

Videos

Conference dediee au chaos engineering par Gremlin. Experiences, patterns, game days.

Voir les talks โ†’

PlatformCon

Videos Gratuit

Conference 100% virtuelle sur le platform engineering. Talks, demos, case studies.

Visiter โ†’

๐Ÿ› ๏ธ Projets Pratiques Suggeres

Projet 1: SLO Dashboard

Prometheus + Grafana

Creer un dashboard complet avec SLIs, SLOs, error budget burn rate. Implementer des alertes basees sur le budget.

Projet 2: Chaos Game Day

LitmusChaos

Organiser un game day complet: hypotheses, experiences, observations, post-mortem. Documenter les findings.

Projet 3: DR Automation

Velero + Terraform

Automatiser un plan DR complet: backup, restore, failover. Tester regulierement avec des drills.

Projet 4: GitOps Pipeline

ArgoCD + Kustomize

Pipeline GitOps complet avec environnements (dev/staging/prod), progressive delivery, rollback automatique.

Projet 5: Load Testing Suite

k6 + GitHub Actions

Suite de tests de performance automatises dans CI/CD. Regression testing, baseline comparison, reporting.

Projet 6: Mini IDP

Backstage

Internal Developer Platform basique avec service catalog, templates, documentation. Golden paths pour devs.