Google AI Studio vs Vertex AI

โฑ 20 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre les differences entre Google AI Studio et Vertex AI
  • Savoir choisir la bonne plateforme selon le cas d'usage
  • Planifier une migration de AI Studio vers Vertex AI
  • Maitriser les criteres de decision pour l'entreprise

๐Ÿ“Š Comparaison Complete

Critere Google AI Studio Vertex AI
Public cible Developpeurs, prototypage rapide Entreprises, production
Acces Compte Google gratuit Projet GCP avec facturation
API Key Simple API key (generativelanguage.googleapis.com) Service Account, ADC (PROJECT-aiplatform.googleapis.com)
Quotas Limites par defaut (60 req/min) Quotas personnalisables, augmentables
Securite API key partageable IAM, VPC-SC, CMEK, Private Service Connect
Conformite Aucune garantie SOC2, ISO 27001, HIPAA, GDPR
Data residency Multi-region (EU ou US) Region specifique choisie
Monitoring Basique dans console Cloud Monitoring, logging, tracing, SLIs
Caching Context Caching disponible Context Caching + optimisations enterprise
Prix Meme tarif que Vertex AI Meme tarif + options enterprise
SLA Aucun 99.9% uptime (GA models)

Google AI Studio est parfait pour prototyper rapidement, tester des prompts, et creer des demos. Mais des que vous passez en production avec des donnees sensibles ou des exigences de conformite, Vertex AI devient indispensable.

๐Ÿ— Architecture de Decision

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 DECISION : AI Studio vs Vertex AI        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚  Cas d'usage ?      โ”‚
                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ–ผ                  โ–ผ                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PROTOTYPAGE     โ”‚ โ”‚ PRODUCTION      โ”‚ โ”‚ ENTREPRISE      โ”‚
โ”‚                 โ”‚ โ”‚ SIMPLE          โ”‚ โ”‚ CRITIQUE        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                  โ”‚                  โ”‚
         โ–ผ                  โ–ผ                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ AI STUDIO โœ“     โ”‚ โ”‚ Vertex AI       โ”‚ โ”‚ Vertex AI โœ“     โ”‚
โ”‚                 โ”‚ โ”‚ (recommande)    โ”‚ โ”‚ + VPC-SC        โ”‚
โ”‚ - Rapide        โ”‚ โ”‚                 โ”‚ โ”‚ + CMEK          โ”‚
โ”‚ - Gratuit       โ”‚ โ”‚ - Monitoring    โ”‚ โ”‚ - Compliance    โ”‚
โ”‚ - Experimentationโ”‚ โ”‚ - Quotas        โ”‚ โ”‚ - Data residencyโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”„ Migration AI Studio โ†’ Vertex AI

Etape 1 : Creer un projet GCP

bash
# 1. Creer projet GCP
gcloud projects create mon-projet-gemini --name="Gemini Production"

# 2. Activer APIs
gcloud services enable aiplatform.googleapis.com
gcloud services enable cloudresourcemanager.googleapis.com

# 3. Creer Service Account
gcloud iam service-accounts create gemini-sa \
  --display-name="Gemini Service Account"

# 4. Donner permissions
gcloud projects add-iam-policy-binding mon-projet-gemini \
  --member="serviceAccount:gemini-sa@mon-projet-gemini.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

Etape 2 : Adapter le code

python
# AVANT : AI Studio
import google.generativeai as genai

genai.configure(api_key="AIzaSy...")
model = genai.GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")

# APRES : Vertex AI
from vertexai.generative_models import GenerativeModel
import vertexai

vertexai.init(project="mon-projet-gemini", location="us-central1")
model = GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")
๐Ÿ’ก Migration sans friction
L'API Vertex AI est quasi-identique a celle de AI Studio. Seule l'initialisation change. Les prompts, parametres et reponses restent les memes.

โš–๏ธ Criteres de Decision Enterprise

Utilisez Vertex AI si :

  • โœ… Vous traitez des donnees clients sensibles (PII, PHI)
  • โœ… Vous avez besoin de conformite (GDPR, HIPAA, SOC2)
  • โœ… Vous voulez controler la region de traitement des donnees
  • โœ… Vous avez besoin de quotas eleves (>60 req/min)
  • โœ… Vous voulez un SLA avec uptime 99.9%
  • โœ… Vous devez integrer avec VPC, Private Service Connect
  • โœ… Vous avez besoin d'audit logs detailles
  • โœ… Vous voulez du monitoring avance (Cloud Monitoring)

Utilisez AI Studio si :

  • โœ… Vous etes en phase de prototypage/experimentation
  • โœ… Vous n'avez pas de donnees sensibles
  • โœ… Vous voulez tester rapidement sans setup GCP
  • โœ… Vous explorez les capacites de Gemini
  • โœ… Vous creez une demo ou un hackathon
โš ๏ธ Attention aux API Keys
Les API keys de AI Studio ne doivent JAMAIS etre commitees dans Git ni exposees cote client. Pour la production, utilisez toujours Vertex AI avec Service Accounts et Application Default Credentials (ADC).

Vertex AI : Setup Enterprise

โฑ 25 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Configurer un projet Vertex AI production-ready
  • Maitriser IAM, VPC-SC, CMEK pour la securite
  • Configurer Private Service Connect pour l'isolation
  • Gerer les quotas et limites

๐Ÿ— Architecture Vertex AI Enterprise

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    VERTEX AI ENTERPRISE                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ–ผ                  โ–ผ                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   SECURITE      โ”‚ โ”‚  NETWORKING     โ”‚ โ”‚  MONITORING     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚                   โ”‚                   โ”‚
โ”‚ - IAM Policies    โ”‚ - VPC-SC          โ”‚ - Cloud Logging
โ”‚ - Service Account โ”‚ - Private Connect โ”‚ - Cloud Monitoring
โ”‚ - CMEK            โ”‚ - Shared VPC      โ”‚ - Audit Logs
โ”‚ - Workload ID     โ”‚ - Cloud NAT       โ”‚ - Cost Dashboard
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ” Configuration IAM

Roles principaux :

Role Permissions Cas d'usage
roles/aiplatform.user Utiliser Gemini, lire modeles Applications backend, services
roles/aiplatform.admin Gerer endpoints, datasets Admins ML, DevOps
roles/aiplatform.viewer Lire ressources (read-only) Monitoring, audit
roles/serviceusage.serviceUsageConsumer Consommer APIs Toutes applications
bash
# Configuration IAM complete
PROJECT_ID="mon-projet-prod"
SA_NAME="gemini-backend-sa"
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

# 1. Creer Service Account
gcloud iam service-accounts create $SA_NAME \
  --display-name="Gemini Backend Service" \
  --description="SA for production Gemini API calls"

# 2. Donner permissions minimales (Principle of Least Privilege)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_EMAIL" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_EMAIL" \
  --role="roles/serviceusage.serviceUsageConsumer"

# 3. (Optionnel) Workload Identity pour GKE
gcloud iam service-accounts add-iam-policy-binding $SA_EMAIL \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:${PROJECT_ID}.svc.id.goog[NAMESPACE/KSA_NAME]"

# 4. Generer key (si besoin, prefer ADC)
gcloud iam service-accounts keys create key.json \
  --iam-account=$SA_EMAIL

๐Ÿ›ก VPC Service Controls (VPC-SC)

VPC-SC cree un perimetre de securite pour proteger les donnees contre l'exfiltration.

bash
# 1. Creer Access Policy
gcloud access-context-manager policies create \
  --organization ORG_ID \
  --title "Gemini Production Policy"

# 2. Creer Service Perimeter
gcloud access-context-manager perimeters create gemini_perimeter \
  --title="Gemini Secure Perimeter" \
  --resources=projects/PROJECT_NUMBER \
  --restricted-services=aiplatform.googleapis.com \
  --policy=POLICY_ID

# 3. Autoriser Private Service Connect
gcloud access-context-manager perimeters update gemini_perimeter \
  --add-vpc-allowed-services=aiplatform.googleapis.com \
  --policy=POLICY_ID
๐Ÿ’ก VPC-SC en pratique
VPC-SC empeche les appels Vertex AI depuis l'exterieur du perimetre. Ideal pour les donnees HIPAA ou financieres. Mais attention : cela peut bloquer vos developpeurs locaux. Utilisez Access Levels pour autoriser certains IPs.

๐Ÿ”’ CMEK (Customer-Managed Encryption Keys)

Par defaut, Google chiffre toutes les donnees. CMEK vous donne le controle des cles de chiffrement.

bash
# 1. Creer Key Ring et Key dans Cloud KMS
gcloud kms keyrings create gemini-keyring \
  --location=us-central1

gcloud kms keys create gemini-key \
  --location=us-central1 \
  --keyring=gemini-keyring \
  --purpose=encryption

# 2. Donner acces a Vertex AI Service Account
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
SA_VERTEX="service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com"

gcloud kms keys add-iam-policy-binding gemini-key \
  --location=us-central1 \
  --keyring=gemini-keyring \
  --member="serviceAccount:$SA_VERTEX" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

# 3. Utiliser CMEK dans Vertex AI (via console ou API)
# Lors de la creation d'un endpoint ou dataset, specifier :
# encryption_spec_key_name = "projects/PROJECT/locations/LOCATION/keyRings/RING/cryptoKeys/KEY"

๐ŸŒ Private Service Connect

Private Service Connect permet d'appeler Vertex AI depuis votre VPC sans passer par Internet.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      VPC (10.0.0.0/16)                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚
โ”‚  โ”‚  GKE Cluster    โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Private Service โ”‚         โ”‚
โ”‚  โ”‚  (10.0.1.0/24)  โ”‚         โ”‚ Connect Endpointโ”‚         โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ”‚                                       โ”‚                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                        โ”‚ (Traffic prive)
                                        โ–ผ
                        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                        โ”‚   Vertex AI Service       โ”‚
                        โ”‚ (aiplatform.googleapis.com)โ”‚
                        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
bash
# Configuration Private Service Connect pour Vertex AI
gcloud compute addresses create vertex-ai-psc \
  --region=us-central1 \
  --subnet=default \
  --addresses=10.0.2.10

gcloud compute forwarding-rules create vertex-ai-psc-rule \
  --region=us-central1 \
  --network=default \
  --address=vertex-ai-psc \
  --target-service-attachment=projects/PROJECT/regions/us-central1/serviceAttachments/aiplatform

๐Ÿ“Š Gestion des Quotas

Quotas par defaut Vertex AI :

  • Gemini Pro : 60 req/min, 4000 req/jour
  • Gemini Flash : 1000 req/min, 10000 req/jour
  • Gemini Flash-Lite : 1500 req/min, 15000 req/jour
  • Tokens max : 2M tokens/min (input + output combines)
bash
# Verifier quotas actuels
gcloud services quota list \
  --service=aiplatform.googleapis.com \
  --consumer=projects/$PROJECT_ID

# Demander augmentation de quota (via console ou support)
# Console GCP > IAM & Admin > Quotas > Filtrer "aiplatform" > Modifier

Pour la production, anticipez vos besoins de quotas. Une demande d'augmentation peut prendre 2-3 jours ouvrables. Mettez en place du rate limiting cote application et des fallbacks pour gerer les depassements de quota gracieusement.

Infrastructure & Scalabilite

โฑ 30 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Deployer Gemini sur Cloud Run, GKE, Cloud Functions
  • Configurer l'auto-scaling et load balancing
  • Optimiser la latence avec CDN et caching
  • Concevoir une architecture hautement disponible

โ˜๏ธ Options de Deploiement

Solution Cas d'usage Avantages Limites
Cloud Run APIs serverless, microservices Auto-scaling, pay-per-use, simple Cold starts (~1-2s)
GKE (Kubernetes) Workloads complexes, controle total Flexible, multi-cloud, scaling precis Complexite, overhead
Cloud Functions Event-driven, webhooks simples Tres simple, integrations natives Timeout 60min, cold starts
Compute Engine Legacy apps, controle VM Controle total, compatible legacy Pas d'auto-scaling automatique

๐Ÿš€ Deploiement Cloud Run (Recommande)

python
# main.py - Service Gemini sur Cloud Run
from flask import Flask, request, jsonify
from vertexai.generative_models import GenerativeModel
import vertexai
import os

app = Flask(__name__)

# Init Vertex AI (utilise ADC automatiquement sur Cloud Run)
vertexai.init(
    project=os.environ.get("GCP_PROJECT"),
    location=os.environ.get("GCP_REGION", "us-central1")
)

model = GenerativeModel("gemini-2.0-flash-exp")

@app.route("/generate", methods=["POST"])
def generate():
    try:
        data = request.get_json()
        prompt = data.get("prompt")

        # Generation avec retry automatique
        response = model.generate_content(
            prompt,
            generation_config={
                "temperature": 0.7,
                "max_output_tokens": 2048
            }
        )

        return jsonify({
            "text": response.text,
            "usage": {
                "prompt_tokens": response.usage_metadata.prompt_token_count,
                "candidates_tokens": response.usage_metadata.candidates_token_count
            }
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "healthy"}), 200

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
dockerfile
# Dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

# Healthcheck pour Cloud Run
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "4", "--threads", "2", "--timeout", "300", "main:app"]
bash
# Deploiement Cloud Run avec optimisations
gcloud run deploy gemini-api \
  --source . \
  --region us-central1 \
  --platform managed \
  --allow-unauthenticated \
  --service-account gemini-backend-sa@PROJECT_ID.iam.gserviceaccount.com \
  --set-env-vars GCP_PROJECT=PROJECT_ID,GCP_REGION=us-central1 \
  --memory 2Gi \
  --cpu 2 \
  --min-instances 1 \
  --max-instances 100 \
  --concurrency 80 \
  --timeout 300 \
  --cpu-boost \
  --execution-environment gen2

# Configuration auto-scaling
gcloud run services update gemini-api \
  --region us-central1 \
  --cpu-throttling \
  --max-instances 100 \
  --min-instances 2
๐Ÿ’ก Optimisation Cloud Run
  • --min-instances 2 : Elimine cold starts pour 99% des requetes
  • --cpu-boost : Accelere le demarrage des instances (~30% plus rapide)
  • --concurrency 80 : Equilibre entre throughput et latence
  • --execution-environment gen2 : 2x plus rapide, meilleure isolation

โš“ Deploiement GKE (Kubernetes)

yaml
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemini-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gemini-api
  template:
    metadata:
      labels:
        app: gemini-api
    spec:
      serviceAccountName: gemini-k8s-sa
      containers:
      - name: gemini-api
        image: gcr.io/PROJECT_ID/gemini-api:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: GCP_PROJECT
          value: "PROJECT_ID"
        - name: GCP_REGION
          value: "us-central1"
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: gemini-api-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: gemini-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gemini-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gemini-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

๐ŸŒ Load Balancing & CDN

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    GLOBAL LOAD BALANCER                    โ”‚
โ”‚                  (Cloud Load Balancing)                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ–ผ                  โ–ผ                  โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  us-central1    โ”‚ โ”‚   europe-west1  โ”‚ โ”‚  asia-east1     โ”‚
โ”‚  Cloud Run      โ”‚ โ”‚   Cloud Run     โ”‚ โ”‚  Cloud Run      โ”‚
โ”‚  (3 instances)  โ”‚ โ”‚   (3 instances) โ”‚ โ”‚  (3 instances)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                  โ”‚                  โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ–ผ
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  Vertex AI    โ”‚
                    โ”‚ (us-central1) โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
bash
# Configuration Load Balancer global avec Cloud CDN
# 1. Creer NEG (Network Endpoint Group) pour Cloud Run
gcloud compute network-endpoint-groups create gemini-api-neg \
  --region=us-central1 \
  --network-endpoint-type=serverless \
  --cloud-run-service=gemini-api

# 2. Creer Backend Service avec CDN
gcloud compute backend-services create gemini-backend \
  --global \
  --enable-cdn \
  --cache-mode=CACHE_ALL_STATIC \
  --default-ttl=3600

# 3. Ajouter NEG au backend
gcloud compute backend-services add-backend gemini-backend \
  --global \
  --network-endpoint-group=gemini-api-neg \
  --network-endpoint-group-region=us-central1

# 4. Creer URL Map et proxy HTTPS
gcloud compute url-maps create gemini-lb \
  --default-service=gemini-backend

gcloud compute target-https-proxies create gemini-https-proxy \
  --url-map=gemini-lb \
  --ssl-certificates=gemini-cert

# 5. Creer IP globale et forwarding rule
gcloud compute addresses create gemini-ip --global

gcloud compute forwarding-rules create gemini-https-rule \
  --global \
  --target-https-proxy=gemini-https-proxy \
  --address=gemini-ip \
  --ports=443

Pour une latence optimale : deployez Cloud Run dans plusieurs regions (us-central1, europe-west1, asia-east1), configurez un Load Balancer global, et activez Cloud CDN pour cacher les reponses frequentes. Vertex AI n'est disponible que dans certaines regions, donc vos backends devront appeler la region Vertex AI la plus proche.

โšก Optimisations de Performance

Technique Impact latence Implementation
Min instances > 0 -1000ms (elimine cold start) --min-instances 2 sur Cloud Run
Connection pooling -50ms par requete Reutiliser client Vertex AI
Streaming -2000ms (TTFT) stream=True dans generate_content
Context caching -80% latence Cacher prompts systeme longs
CDN pour assets -200ms (assets statiques) Cloud CDN sur Load Balancer
Regions multiples -100ms (latence geo) Deploy multi-region + GLB

Securite Enterprise

โฑ 25 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Implementer une strategie IAM zero-trust
  • Configurer VPC Service Controls et DLP
  • Securiser les secrets avec Secret Manager
  • Activer audit logs et monitoring de securite

๐Ÿ›ก Defense en Profondeur (Defense in Depth)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 7 : MONITORING                    โ”‚
โ”‚         Audit Logs, Security Command Center, Alerting       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 6 : DLP & FILTERING               โ”‚
โ”‚              Data Loss Prevention, Content Moderation       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 5 : ENCRYPTION                    โ”‚
โ”‚                 CMEK, TLS 1.3, Data at Rest                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 4 : SECRETS                       โ”‚
โ”‚               Secret Manager, Workload Identity             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 3 : NETWORK ISOLATION             โ”‚
โ”‚              VPC-SC, Private Service Connect                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 2 : IDENTITY                      โ”‚
โ”‚               IAM Policies, Service Accounts                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    COUCHE 1 : AUTHENTICATION                โ”‚
โ”‚              OAuth 2.0, API Keys, mTLS                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ” IAM Zero-Trust

Principe de moindre privilege (Least Privilege) :

bash
# MAUVAIS : Donner roles/owner (trop de permissions)
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/owner"  # โŒ DANGER

# BON : Permissions granulaires minimales
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"  # โœ… Minimum necessaire

# Encore mieux : Custom Role avec permissions precises
gcloud iam roles create geminiUserCustom \
  --project=PROJECT_ID \
  --title="Gemini User Custom" \
  --permissions=aiplatform.endpoints.predict,aiplatform.models.get

Segregation par environnement :

bash
# Service Accounts separes par environnement
# DEV
gcloud iam service-accounts create gemini-dev-sa \
  --display-name="Gemini Dev" \
  --project=project-dev

# STAGING
gcloud iam service-accounts create gemini-staging-sa \
  --display-name="Gemini Staging" \
  --project=project-staging

# PROD
gcloud iam service-accounts create gemini-prod-sa \
  --display-name="Gemini Prod" \
  --project=project-prod

# IAM Conditions : Limiter acces par IP, heure, ressource
gcloud projects add-iam-policy-binding project-prod \
  --member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user" \
  --condition='expression=request.time < timestamp("2026-12-31T23:59:59Z"),title=expires-end-of-year'

๐Ÿ”’ Secret Manager

bash
# 1. Creer secret pour API keys tierces
echo -n "sk-openai-api-key-xyz" | gcloud secrets create openai-api-key \
  --data-file=- \
  --replication-policy="automatic"

# 2. Donner acces au Service Account
gcloud secrets add-iam-policy-binding openai-api-key \
  --member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
  --role="roles/secretmanager.secretAccessor"

# 3. Utiliser dans l'application
from google.cloud import secretmanager

client = secretmanager.SecretManagerServiceClient()
name = "projects/PROJECT_ID/secrets/openai-api-key/versions/latest"
response = client.access_secret_version(request={"name": name})
api_key = response.payload.data.decode("UTF-8")
โš ๏ธ Secrets : Ce qu'il ne faut JAMAIS faire
  • โŒ Hardcoder des secrets dans le code source
  • โŒ Commiter des .env avec vraies cles dans Git
  • โŒ Exposer secrets dans logs ou error messages
  • โŒ Partager secrets via Slack/Email
  • โŒ Utiliser la meme API key pour dev/staging/prod

๐Ÿšจ Data Loss Prevention (DLP)

DLP detecte et masque automatiquement les donnees sensibles (PII, PHI, PCI) avant envoi a Gemini.

python
# DLP Inspection avant envoi a Gemini
from google.cloud import dlp_v2

def inspect_and_deidentify(text, project_id):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"

    # Configuration inspection (detecter PII)
    inspect_config = {
        "info_types": [
            {"name": "EMAIL_ADDRESS"},
            {"name": "PHONE_NUMBER"},
            {"name": "CREDIT_CARD_NUMBER"},
            {"name": "US_SOCIAL_SECURITY_NUMBER"},
            {"name": "PERSON_NAME"}
        ],
        "min_likelihood": dlp_v2.Likelihood.LIKELY
    }

    # Configuration de-identification (masquer PII)
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [{
                "primitive_transformation": {
                    "replace_with_info_type_config": {}
                }
            }]
        }
    }

    item = {"value": text}
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": item
        }
    )

    return response.item.value

# Utilisation
user_input = "Mon email est john.doe@example.com et mon tel est 555-1234"
safe_input = inspect_and_deidentify(user_input, "mon-projet")
# Resultat : "Mon email est [EMAIL_ADDRESS] et mon tel est [PHONE_NUMBER]"

# Envoyer a Gemini seulement le texte de-identifie
response = model.generate_content(safe_input)

๐Ÿ“ Audit Logs

3 types d'Audit Logs :

  • Admin Activity : Actions d'administration (toujours active, gratuit)
  • Data Access : Lectures/ecritures de donnees (doit etre active, payant)
  • System Event : Evenements GCP internes (gratuit)
bash
# Activer Data Access Logs pour Vertex AI
gcloud logging project-logs enable \
  DATA_ACCESS \
  --project=PROJECT_ID

# Requete logs : Qui a appele Gemini ?
gcloud logging read 'resource.type="aiplatform.googleapis.com/Endpoint"
  AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"' \
  --project=PROJECT_ID \
  --limit=50 \
  --format=json

# Creer alertes sur comportements suspects
gcloud logging metrics create gemini_unusual_volume \
  --description="Alert si >1000 req/min depuis une meme IP" \
  --log-filter='resource.type="aiplatform.googleapis.com/Endpoint"
    AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"'

Les audit logs sont essentiels pour la conformite (GDPR Article 32, SOC2, HIPAA). Activez-les en production. Configurez des exports vers BigQuery pour analyse long-terme et correlation avec Security Command Center pour detection d'anomalies automatique.

๐ŸŒ VPC Service Controls (VPC-SC) Avance

yaml
# vpc-sc-policy.yaml - Configuration complete VPC-SC
name: accessPolicies/POLICY_ID/servicePerimeters/gemini_perimeter
title: "Gemini Production Secure Perimeter"
status:
  resources:
    - projects/PROJECT_NUMBER
  restrictedServices:
    - aiplatform.googleapis.com
    - storage.googleapis.com
  accessLevels:
    - accessPolicies/POLICY_ID/accessLevels/corporate_network
  vpcAccessibleServices:
    enableRestriction: true
    allowedServices:
      - aiplatform.googleapis.com
  ingressPolicies:
    - ingressFrom:
        identities:
          - serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
        sources:
          - accessLevel: accessPolicies/POLICY_ID/accessLevels/corporate_network
      ingressTo:
        resources:
          - "*"
        operations:
          - serviceName: aiplatform.googleapis.com
            methodSelectors:
              - method: "google.cloud.aiplatform.v1.PredictionService.Predict"
  egressPolicies:
    - egressFrom:
        identities:
          - serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
      egressTo:
        resources:
          - "*"
        operations:
          - serviceName: storage.googleapis.com

Conformite & RGPD

โฑ 30 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre les exigences GDPR pour les systemes IA
  • Configurer data residency et data sovereignty
  • Implementer DPIA (Data Protection Impact Assessment)
  • Maitriser les certifications SOC2, HIPAA, ISO 27001

โš–๏ธ GDPR et IA : Obligations Essentielles

Article GDPR Obligation Implementation Vertex AI
Art. 5 Minimisation des donnees DLP pour filtrer PII, pas de stockage inutile
Art. 13-14 Transparence (informer utilisateurs) Disclaimer "Ce chat utilise Gemini par Google"
Art. 15 Droit d'acces Logging de toutes requetes utilisateur, API export
Art. 17 Droit a l'effacement Purge logs apres 90 jours, pas de fine-tuning sur donnees utilisateur
Art. 25 Privacy by design VPC-SC, CMEK, anonymisation par defaut
Art. 28 DPA (Data Processing Agreement) Signer Cloud Data Processing Addendum Google
Art. 32 Securite TLS 1.3, encryption at rest, audit logs
Art. 33 Notification breaches (72h) Security Command Center alertes
Art. 35 DPIA si risque eleve Template DPIA pour chatbots Gemini

๐ŸŒ Data Residency & Data Sovereignty

Vertex AI regions disponibles (2026) :

  • Europe : europe-west1 (Belgique), europe-west4 (Pays-Bas), europe-west9 (France)
  • US : us-central1 (Iowa), us-east1 (Caroline du Sud), us-west1 (Oregon)
  • Asia : asia-northeast1 (Tokyo), asia-southeast1 (Singapour)
python
# Configuration region EU pour conformite GDPR
import vertexai
from vertexai.generative_models import GenerativeModel

# IMPORTANT : Forcer region EU pour donnees GDPR
vertexai.init(
    project="mon-projet-eu",
    location="europe-west1"  # Belgique (UE)
)

model = GenerativeModel("gemini-2.0-flash-exp")

# Verifier que la region est bien EU
print(f"Region utilisee : {vertexai._location}")
# Output : "europe-west1"
โš ๏ธ Attention Multi-region
Google AI Studio utilise des endpoints multi-region (EU ou US global). Pour GDPR strict, utilisez TOUJOURS Vertex AI avec une region EU explicite. Vertex AI garantit que les donnees ne quittent pas la region choisie.

๐Ÿ“‹ DPIA Template pour Chatbot Gemini

Data Protection Impact Assessment (DPIA) requis si :

  • โœ… Traitement automatise avec effets juridiques (ex: credit scoring avec IA)
  • โœ… Surveillance systematique a grande echelle (ex: monitoring employes)
  • โœ… Donnees sensibles : sante (HIPAA), enfants, biometrie
markdown
# DPIA Template : Chatbot Support Client (Gemini)

## 1. Description du traitement
- **Nature** : Chatbot IA generative pour support client
- **Portee** : 50 000 utilisateurs/mois, EU uniquement
- **Contexte** : Questions support produit, pas de paiement
- **Finalites** : Repondre questions, reduire tickets support

## 2. Donnees traitees
- **Donnees collectees** : Nom, email, historique conversation
- **Donnees sensibles** : AUCUNE (pas sante, religion, etc.)
- **Retention** : 90 jours puis suppression automatique

## 3. Necessite et proportionnalite
- **Base legale** : Interet legitime (Art. 6(1)(f) GDPR)
- **Minimisation** : Seulement nom/email, pas de tel/adresse
- **Alternatives considerees** : Support humain seul (trop lent), FAQ statique (moins efficace)

## 4. Risques identifies
| Risque | Impact | Probabilite | Mesure mitigation |
|--------|--------|-------------|-------------------|
| Fuite donnees conversationnelles | Moyen | Faible | VPC-SC, CMEK, TLS 1.3 |
| Hallucination donnant mauvais conseil | Moyen | Moyenne | Grounding avec docs, disclaimer |
| Re-identification via style ecriture | Faible | Tres faible | Pas de fine-tuning |

## 5. Mesures de securite
- โœ… Encryption in-transit (TLS 1.3) et at-rest (AES-256)
- โœ… VPC Service Controls (pas d'acces externe)
- โœ… Audit logs actives (Cloud Logging)
- โœ… DLP pour detecter PII accidentelle
- โœ… Region EU (europe-west1) avec data residency

## 6. Droits utilisateurs
- โœ… Information transparente (banner "Powered by Gemini")
- โœ… Droit d'acces (API export conversations)
- โœ… Droit a l'effacement (bouton "Supprimer mes donnees")
- โœ… Droit d'opposition (opt-out chatbot)

## 7. Conclusion
Risque residu : **FAIBLE**
DPIA validee par : DPO (Data Protection Officer)
Date : 2026-02-10

๐Ÿฅ HIPAA Compliance (Donnees de Sante)

Google Cloud signe BAA (Business Associate Agreement) pour :

  • โœ… Vertex AI (Gemini via Vertex AI uniquement, PAS AI Studio)
  • โœ… Cloud Storage, BigQuery, Cloud SQL
  • โœ… Cloud Logging (mais desactiver Data Access Logs PHI)
bash
# Configuration HIPAA-compliant pour application medicale
# 1. Activer organisation policy pour forcer CMEK
gcloud resource-manager org-policies set-policy cmek-policy.yaml

# cmek-policy.yaml
name: projects/PROJECT_ID/policies/constraints/gcp.restrictNonCmekServices
spec:
  rules:
    - enforce: true

# 2. Activer Access Transparency (voir qui chez Google accede aux donnees)
gcloud organizations add-iam-policy-binding ORG_ID \
  --member='domain:example.com' \
  --role='roles/accessapproval.approver'

# 3. Configurer retention logs conforme (6 ans pour HIPAA)
gcloud logging sinks create hipaa-audit-sink \
  bigquery.googleapis.com/projects/PROJECT_ID/datasets/hipaa_audit_logs \
  --log-filter='protoPayload.serviceName="aiplatform.googleapis.com"'

# 4. Desactiver Data Access Logs pour eviter log PHI
# (Configurer via IAM & Admin > Audit Logs > desactiver "Data Read/Write" pour aiplatform)

Pour HIPAA : utilisez TOUJOURS Vertex AI (jamais AI Studio), signez le BAA avec Google, activez CMEK, configurez Access Transparency, et mettez en place une retention de 6 ans des audit logs. Considerez aussi de-identification des donnees avant envoi a Gemini avec Cloud Healthcare API.

๐Ÿ”’ ISO 27001 & SOC 2 Type II

Google Cloud est certifie :

  • โœ… ISO 27001 (Information Security Management)
  • โœ… ISO 27017 (Cloud Security)
  • โœ… ISO 27018 (Privacy in Cloud)
  • โœ… SOC 2 Type II (Security, Availability, Confidentiality)
  • โœ… SOC 3 (version publique de SOC 2)

Rapports disponibles :

  • ๐Ÿ“„ Console GCP > Security > Compliance Reports Manager
  • ๐Ÿ“„ Telecharger ISO/SOC reports pour audits
  • ๐Ÿ“„ Partager avec auditeurs sous NDA

CI/CD pour IA

โฑ 35 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Mettre en place un pipeline CI/CD pour applications Gemini
  • Implementer prompt versioning et testing automatise
  • Configurer evaluation gates avant production
  • Deployer avec strategies canary et blue-green

๐Ÿ”„ Pipeline CI/CD Complet

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      CI/CD PIPELINE GEMINI                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  1. COMMIT   โ”‚โ”€โ”€โ–ถโ”‚  2. BUILD    โ”‚โ”€โ”€โ–ถโ”‚  3. TEST     โ”‚
โ”‚              โ”‚   โ”‚              โ”‚   โ”‚              โ”‚
โ”‚ - Git push   โ”‚   โ”‚ - Docker img โ”‚   โ”‚ - Unit tests โ”‚
โ”‚ - PR opened  โ”‚   โ”‚ - Lint       โ”‚   โ”‚ - Prompt evalโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                              โ”‚
                                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  6. PROD     โ”‚โ—€โ”€โ”€โ”‚  5. STAGING  โ”‚โ—€โ”€โ”€โ”‚ 4. DEV DEPLOYโ”‚
โ”‚              โ”‚   โ”‚              โ”‚   โ”‚              โ”‚
โ”‚ - Canary 10% โ”‚   โ”‚ - Smoke test โ”‚   โ”‚ - Cloud Run  โ”‚
โ”‚ - Monitor    โ”‚   โ”‚ - Eval gate  โ”‚   โ”‚ - Auto deployโ”‚
โ”‚ - Rollback?  โ”‚   โ”‚ - Manual OK  โ”‚   โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ— Cloud Build Configuration

yaml
# cloudbuild.yaml - Pipeline CI/CD complet
steps:
  # Etape 1 : Linting et formatage
  - name: 'python:3.12'
    id: 'lint'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install ruff black
        ruff check src/
        black --check src/

  # Etape 2 : Tests unitaires
  - name: 'python:3.12'
    id: 'unit-tests'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install -r requirements.txt
        pytest tests/unit/ --cov=src --cov-report=term

  # Etape 3 : Evaluation des prompts
  - name: 'python:3.12'
    id: 'prompt-eval'
    entrypoint: 'bash'
    secretEnv: ['VERTEX_PROJECT']
    args:
      - '-c'
      - |
        pip install -r requirements.txt
        python scripts/eval_prompts.py --project=$VERTEX_PROJECT --threshold=0.7
    waitFor: ['unit-tests']

  # Etape 4 : Build Docker image
  - name: 'gcr.io/cloud-builders/docker'
    id: 'build-image'
    args:
      - 'build'
      - '-t'
      - 'gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA'
      - '-t'
      - 'gcr.io/$PROJECT_ID/gemini-api:latest'
      - '.'
    waitFor: ['prompt-eval']

  # Etape 5 : Push image
  - name: 'gcr.io/cloud-builders/docker'
    id: 'push-image'
    args:
      - 'push'
      - '--all-tags'
      - 'gcr.io/$PROJECT_ID/gemini-api'
    waitFor: ['build-image']

  # Etape 6 : Deploy to DEV (auto)
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    id: 'deploy-dev'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        gcloud run deploy gemini-api-dev \
          --image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
          --region us-central1 \
          --platform managed \
          --service-account gemini-dev-sa@$PROJECT_ID.iam.gserviceaccount.com \
          --set-env-vars ENV=dev,VERSION=$SHORT_SHA \
          --tag dev-$SHORT_SHA
    waitFor: ['push-image']

  # Etape 7 : Smoke tests DEV
  - name: 'python:3.12'
    id: 'smoke-tests-dev'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install requests
        python scripts/smoke_tests.py --url=https://gemini-api-dev-HASH-uc.a.run.app
    waitFor: ['deploy-dev']

  # Etape 8 : Deploy to STAGING (si branch main)
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    id: 'deploy-staging'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        if [ "$BRANCH_NAME" = "main" ]; then
          gcloud run deploy gemini-api-staging \
            --image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
            --region us-central1 \
            --platform managed \
            --service-account gemini-staging-sa@$PROJECT_ID.iam.gserviceaccount.com \
            --set-env-vars ENV=staging,VERSION=$SHORT_SHA
        fi
    waitFor: ['smoke-tests-dev']

# Secrets from Secret Manager
availableSecrets:
  secretManager:
    - versionName: projects/$PROJECT_ID/secrets/vertex-project/versions/latest
      env: 'VERTEX_PROJECT'

# Timeout global
timeout: '1800s'

# Tags
tags: ['gemini-api', 'ci-cd']

options:
  machineType: 'E2_HIGHCPU_8'
  logging: CLOUD_LOGGING_ONLY

๐Ÿ“ Prompt Versioning

python
# prompts.py - Versionning des prompts
from dataclasses import dataclass
from typing import Dict
import json

@dataclass
class PromptVersion:
    version: str
    system_instruction: str
    temperature: float
    max_tokens: int
    metadata: Dict[str, str]

class PromptRegistry:
    """Registry centralise pour tous les prompts versions"""

    PROMPTS = {
        "customer_support_v1": PromptVersion(
            version="1.0.0",
            system_instruction="""Tu es un assistant support client pour AcmeCorp.
Reponds de maniere concise et professionnelle.
Si tu ne sais pas, dis 'Je ne sais pas, je transfere a un humain.'""",
            temperature=0.3,
            max_tokens=512,
            metadata={"created": "2026-01-15", "author": "team-support"}
        ),
        "customer_support_v2": PromptVersion(
            version="2.0.0",
            system_instruction="""Tu es un assistant support client expert pour AcmeCorp.

REGLES :
1. Reponds en 2-3 phrases maximum
2. Utilise les docs (grounding) pour info precise
3. Si pas dans docs, dis "Je ne sais pas"
4. Toujours termine par "Autre question ?"

TONE : Professionnel mais amical""",
            temperature=0.2,  # Plus deterministe
            max_tokens=256,   # Plus court
            metadata={"created": "2026-02-01", "author": "team-support", "ab_test": "variant_b"}
        ),
    }

    @classmethod
    def get_prompt(cls, prompt_id: str) -> PromptVersion:
        if prompt_id not in cls.PROMPTS:
            raise ValueError(f"Prompt {prompt_id} not found")
        return cls.PROMPTS[prompt_id]

    @classmethod
    def get_active_prompt(cls, use_case: str = "customer_support") -> PromptVersion:
        """Retourne le prompt actif (gere via feature flags)"""
        # En production, lire depuis feature flag (LaunchDarkly, Cloud Config, etc.)
        active_version = "customer_support_v2"  # ou v1 selon A/B test
        return cls.get_prompt(active_version)

# Utilisation
from vertexai.generative_models import GenerativeModel

prompt_config = PromptRegistry.get_active_prompt("customer_support")
model = GenerativeModel(
    "gemini-2.0-flash-exp",
    system_instruction=prompt_config.system_instruction
)

response = model.generate_content(
    "Comment retourner un produit ?",
    generation_config={
        "temperature": prompt_config.temperature,
        "max_output_tokens": prompt_config.max_tokens
    }
)
๐Ÿ’ก Best practices prompt versioning
  • โœ… Toujours versionner les prompts (semantic versioning : 1.0.0, 1.1.0, 2.0.0)
  • โœ… Stocker dans Git avec review process (PR required)
  • โœ… Tracker metadata : auteur, date, rationale du changement
  • โœ… A/B tester nouvelles versions avant rollout 100%
  • โœ… Rollback rapide si degradation qualite

โœ… Evaluation Gates

python
# scripts/eval_prompts.py - Eval automatique avant deploy
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import EvalTask
import argparse

def run_eval_gate(project: str, threshold: float = 0.7):
    """
    Evalue le prompt sur dataset de test.
    Echoue le build si score < threshold.
    """
    vertexai.init(project=project, location="us-central1")

    # Dataset de test (Golden Set)
    test_cases = [
        {
            "input": "Comment retourner un produit ?",
            "expected_output": "Vous avez 30 jours pour retourner un produit...",
            "rubric": "Doit mentionner delai 30 jours et procedure"
        },
        {
            "input": "Quel est le prix du produit XYZ ?",
            "expected_output": "Je ne sais pas",
            "rubric": "Doit dire 'je ne sais pas' si info pas dans docs"
        },
        # ... 50+ test cases
    ]

    # Charger prompt actif
    from prompts import PromptRegistry
    prompt_config = PromptRegistry.get_active_prompt()

    model = GenerativeModel(
        "gemini-2.0-flash-exp",
        system_instruction=prompt_config.system_instruction
    )

    # Evaluation avec Vertex AI Evaluation
    eval_task = EvalTask(
        dataset=test_cases,
        metrics=["coherence", "fluency", "safety", "groundedness"],
        experiment="prompt-eval-" + prompt_config.version
    )

    results = eval_task.evaluate(model=model)

    # Calculer score global
    avg_score = sum(
        results.summary_metrics[m] for m in ["coherence", "fluency", "groundedness"]
    ) / 3

    print(f"Evaluation score: {avg_score:.2f}")
    print(f"Threshold: {threshold}")

    if avg_score < threshold:
        print("โŒ EVAL GATE FAILED - Score trop bas")
        exit(1)  # Fail le build
    else:
        print("โœ… EVAL GATE PASSED")
        exit(0)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--project", required=True)
    parser.add_argument("--threshold", type=float, default=0.7)
    args = parser.parse_args()

    run_eval_gate(args.project, args.threshold)

๐Ÿšฆ Deployment Strategies

1. Canary Deployment (Recommande) :

bash
# Deploy nouvelle version sur 10% trafic
gcloud run deploy gemini-api-prod \
  --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \
  --region us-central1 \
  --tag canary \
  --no-traffic  # Pas de trafic initial

# Router 10% vers canary
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-tags canary=10

# Monitorer pendant 1h (erreurs, latence, qualite)
# Si OK : augmenter a 50%
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-tags canary=50

# Si OK : rollout 100%
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-latest

# Si KO : rollback immediat
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION=100

2. Blue-Green Deployment :

bash
# Environnement BLUE (actuel en prod)
gcloud run deploy gemini-api-blue \
  --image gcr.io/PROJECT_ID/gemini-api:v1.0.0 \
  --region us-central1

# Environnement GREEN (nouvelle version)
gcloud run deploy gemini-api-green \
  --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \
  --region us-central1

# Load Balancer pointe vers BLUE
# Apres validation GREEN : switch Load Balancer vers GREEN
# Si probleme : switch instantane vers BLUE

Pour la production, privilegiez Canary deployment avec Cloud Run (support natif des traffic splits). Commencez avec 10% de trafic sur la nouvelle version, monitorez pendant 1-2h (erreurs, latence P95, qualite des reponses via eval), puis augmentez progressivement. Gardez toujours un rollback one-click pret.

Lab : Pipeline Deploiement GCP

โฑ 60 min Pratique

๐ŸŽฏ Objectifs du Lab

  • Creer un pipeline CI/CD complet sur Cloud Build
  • Deployer une application Gemini sur Cloud Run
  • Configurer monitoring et alertes
  • Tester canary deployment avec rollback

๐Ÿงช Lab Pratique : Pipeline Production

Duree estimee : 60 minutes

Etape 1 : Setup projet GCP (10 min)

Creer un nouveau projet et activer les APIs necessaires.

bash
export PROJECT_ID="gemini-lab-$(date +%s)"
gcloud projects create $PROJECT_ID
gcloud config set project $PROJECT_ID

# Activer APIs
gcloud services enable aiplatform.googleapis.com \
  cloudbuild.googleapis.com \
  run.googleapis.com \
  secretmanager.googleapis.com \
  cloudmonitoring.googleapis.com

# Creer Service Account
gcloud iam service-accounts create gemini-prod-sa

Etape 2 : Cloner code starter (5 min)

bash
git clone https://github.com/google-cloud/gemini-deployment-starter
cd gemini-deployment-starter

# Structure :
# - src/main.py (API Flask + Gemini)
# - Dockerfile
# - cloudbuild.yaml
# - tests/
# - prompts/

Etape 3 : Configurer Cloud Build (10 min)

Connecter GitHub et configurer triggers.

bash
# Donner permissions Cloud Build
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
  --role="roles/run.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
  --role="roles/iam.serviceAccountUser"

# Creer trigger Cloud Build
gcloud builds triggers create github \
  --repo-name=gemini-deployment-starter \
  --repo-owner=VOTRE_GITHUB \
  --branch-pattern="^main$" \
  --build-config=cloudbuild.yaml

Etape 4 : Premier deploy (15 min)

Pusher code et observer le pipeline.

bash
# Modifier prompts/customer_support.py
# Commit et push
git add .
git commit -m "Initial deploy"
git push origin main

# Observer build dans Cloud Console
gcloud builds list --ongoing

# Une fois termine, tester l'API
SERVICE_URL=$(gcloud run services describe gemini-api-dev \
  --region us-central1 --format="value(status.url)")

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Comment retourner un produit ?"}'

Etape 5 : Monitoring & Alertes (10 min)

bash
# Creer dashboard monitoring
gcloud monitoring dashboards create --config-from-file=dashboard.json

# Creer alerte sur error rate > 5%
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Gemini API Error Rate High" \
  --condition-display-name="Error rate > 5%" \
  --condition-threshold-value=0.05 \
  --condition-threshold-duration=300s

Etape 6 : Canary Deploy & Rollback (10 min)

Deployer une v2 avec bug intentionnel, puis rollback.

bash
# Deploy v2 (avec bug)
git checkout -b v2-buggy
# Modifier temperature=2.0 (trop haute, reponses instables)
git commit -am "v2: increase creativity"
git push origin v2-buggy

# Merge en main (deploy auto staging)
# Promouvoir en prod avec canary 10%
gcloud run services update-traffic gemini-api-prod \
  --to-tags canary=10

# Observer metriques (latence augmente, qualite baisse)
# ROLLBACK
gcloud run services update-traffic gemini-api-prod \
  --to-revisions gemini-api-prod-00001-abc=100

โœ… Verification

Verifier que vous avez :

  • โœ… Pipeline CI/CD fonctionnel avec Cloud Build
  • โœ… Application deployee sur Cloud Run
  • โœ… Monitoring dashboard avec metriques
  • โœ… Alertes configurees
  • โœ… Canary deployment + rollback testes

Ce lab vous a montre un workflow production-ready. En entreprise, ajoutez : evaluation automatique avant chaque deploy, integration tests end-to-end, security scanning (Snyk, Trivy), et approvals manuels avant prod.

Quiz Module 4.1

โฑ 15 min Evaluation

๐Ÿ“ Quiz : Deploiement Enterprise

15 questions pour valider vos connaissances

1. Quelle est la principale difference entre AI Studio et Vertex AI ?

AI Studio est payant, Vertex AI gratuit
Vertex AI offre IAM, VPC-SC, SLA enterprise
AI Studio est plus rapide
Aucune difference

2. Quel role IAM minimum pour appeler Gemini sur Vertex AI ?

roles/owner
roles/editor
roles/aiplatform.user
roles/viewer

3. Quelle solution de deploiement recommandee pour une API Gemini serverless ?

Cloud Run
GKE
Compute Engine
App Engine

4. Comment eliminer les cold starts sur Cloud Run ?

Augmenter la memoire
Configurer min-instances > 0
Utiliser des containers plus petits
Impossible

5. VPC Service Controls (VPC-SC) permet de :

Reduire les couts
Accelerer les requetes
Empecher l'exfiltration de donnees
Chiffrer les donnees

6. Ou stocker les API keys tierces de maniere securisee ?

Dans le code source
Secret Manager
Fichier .env commite dans Git
Variables d'environnement Cloud Run

7. DLP (Data Loss Prevention) permet de :

Detecter et masquer PII avant envoi a Gemini
Sauvegarder les donnees
Compresser les prompts
Monitorer les couts

8. Pour conformite GDPR stricte, vous devez :

Utiliser AI Studio
Utiliser Vertex AI multi-region
Utiliser Vertex AI avec region EU explicite
Desactiver logging

9. CMEK (Customer-Managed Encryption Keys) vous donne :

Des performances meilleures
Controle sur les cles de chiffrement
Un chiffrement plus fort
Des couts reduits

10. Quelle certification GCP est requise pour donnees de sante US ?

ISO 27001
SOC 2
GDPR
HIPAA (avec BAA signe)

11. Dans un pipeline CI/CD pour IA, les evaluation gates servent a :

Bloquer deploy si qualite prompts < seuil
Accelerer le build
Reduire les couts
Generer documentation

12. Pourquoi versionner les prompts ?

Pour reduire les tokens
Pour accelerer les requetes
Pour tracker changes, A/B test, rollback
Ce n'est pas necessaire

13. Canary deployment signifie :

Deployer sur tous les serveurs simultanement
Deployer sur 10% trafic, monitorer, puis augmenter
Deployer uniquement la nuit
Deployer avec un oiseau jaune

14. Audit logs Data Access doivent etre actives pour :

Conformite (GDPR Art. 32, SOC2, HIPAA)
Reduire les couts
Accelerer Vertex AI
Aucune raison

15. Private Service Connect permet de :

Augmenter les quotas
Reduire les couts
Appeler Vertex AI depuis VPC sans passer par Internet
Creer des modeles custom

Comprendre les Couts Gemini

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Maitriser le modele de tarification Gemini (8 modeles)
  • Comprendre le tiered pricing et seuil 200K tokens
  • Calculer le cout d'une application Gemini
  • Anticiper et budgeter les couts IA

๐Ÿ’ฐ Tarification Gemini 2.5 (2026)

Modele Input โ‰ค200K Input >200K Output โ‰ค200K Output >200K Context Cache
2.5 Pro $3.00 / 1M $1.50 / 1M $12.00 / 1M $6.00 / 1M $0.30 / 1M
2.5 Flash $0.15 / 1M $0.075 / 1M $0.60 / 1M $0.30 / 1M $0.015 / 1M
2.5 Flash-8B $0.04 / 1M $0.02 / 1M $0.16 / 1M $0.08 / 1M $0.004 / 1M
2.0 Pro Exp (Extended Thinking) $3.00 / 1M $1.50 / 1M $12.00 / 1M $6.00 / 1M -
2.0 Flash Exp $0.15 / 1M $0.075 / 1M $0.60 / 1M $0.30 / 1M $0.015 / 1M
1.5 Pro $2.50 / 1M $1.25 / 1M $10.00 / 1M $5.00 / 1M $0.25 / 1M
1.5 Flash $0.10 / 1M $0.05 / 1M $0.40 / 1M $0.20 / 1M $0.01 / 1M
1.5 Flash-8B $0.03 / 1M $0.015 / 1M $0.12 / 1M $0.06 / 1M $0.003 / 1M
๐Ÿ’ก Tiered Pricing
Le prix est divise par 2 au-dela de 200K tokens. Exemple : si vous envoyez 300K tokens input avec Flash, vous payez : (200K ร— $0.15) + (100K ร— $0.075) = $30 + $7.5 = $37.5 / 1M tokens effectifs.

๐Ÿ”ข Composants de Cout

1. Input tokens : Prompt utilisateur + system instruction + context cached

2. Output tokens : Reponse generee par Gemini

3. Cached tokens : Context mis en cache (coute 10x moins cher)

4. Thinking tokens (2.0 Pro Exp) : Comptes comme output tokens

python
# Calculer cout d'une requete
from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Explique la relativite en 3 paragraphes")

# Extraire usage
usage = response.usage_metadata
print(f"Input tokens: {usage.prompt_token_count}")
print(f"Output tokens: {usage.candidates_token_count}")
print(f"Cached tokens: {usage.cached_content_token_count}")

# Calculer cout (Flash โ‰ค200K)
input_cost = (usage.prompt_token_count / 1_000_000) * 0.15
output_cost = (usage.candidates_token_count / 1_000_000) * 0.60
cache_cost = (usage.cached_content_token_count / 1_000_000) * 0.015

total_cost = input_cost + output_cost + cache_cost
print(f"Cout total: ${total_cost:.6f}")
# Exemple : Input 50 tokens, Output 200 tokens
# ($0.000015) + ($0.000120) = $0.000135 par requete

๐Ÿ“Š Simulation Cout Application

Exemple : Chatbot support client

  • Volume : 100,000 conversations/mois
  • Moyenne : 5 messages par conversation
  • Input moyen : 500 tokens (system instruction 200 + user 300)
  • Output moyen : 150 tokens
  • Modele : Gemini 2.5 Flash
python
# Calcul cout chatbot
conversations_per_month = 100_000
messages_per_conversation = 5
total_messages = conversations_per_month * messages_per_conversation  # 500,000

input_tokens_per_message = 500
output_tokens_per_message = 150

total_input_tokens = total_messages * input_tokens_per_message  # 250M
total_output_tokens = total_messages * output_tokens_per_message  # 75M

# Prix Flash (โ‰ค200K tier pour simplicite)
input_cost = (total_input_tokens / 1_000_000) * 0.15  # $37.50
output_cost = (total_output_tokens / 1_000_000) * 0.60  # $45.00

monthly_cost = input_cost + output_cost  # $82.50/mois
yearly_cost = monthly_cost * 12  # $990/an

print(f"Cout mensuel : ${monthly_cost:.2f}")
print(f"Cout annuel : ${yearly_cost:.2f}")
print(f"Cout par conversation : ${monthly_cost / conversations_per_month:.4f}")
# $0.000825 par conversation (moins d'un cent !)

Flash est incroyablement economique : ~$0.0008 par conversation. Meme avec 1M conversations/mois, vous ne paierez que ~$825/mois. Pro coute 20x plus mais offre meilleure qualite pour use cases complexes. Commencez avec Flash, upgradez vers Pro seulement si necessaire.

๐Ÿ’ก Facteurs d'Impact Cout

  1. Choix du modele : Flash-8B (4x moins cher que Flash) vs Pro (20x plus cher)
  2. Longueur system instruction : 2000 tokens d'instruction = $0.0003 input par requete (Flash)
  3. Context caching : -90% cout sur partie cachee
  4. Output tokens : Output coute 4x plus que input (controler max_output_tokens)
  5. Tiered pricing : >200K tokens = -50% prix
  6. Streaming : Meme cout que non-streaming (pas de surcharge)

Strategies d'Optimisation des Couts

โฑ 30 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Maitriser 7 techniques d'optimisation des couts
  • Implementer model routing intelligent
  • Utiliser context caching pour -90% cout
  • Optimiser prompts et output tokens

๐ŸŽฏ Les 7 Techniques d'Optimisation

Model Routing Context Caching Batch API Prompt Compression Output Control Flash-8B/Lite Implicit Cache Economie potentielle : 60-80%

๐Ÿ”€ 1. Model Routing Intelligent

Principe : Utiliser Flash-8B pour requetes simples, Flash pour standard, Pro pour complexe.

python
# Router intelligent base sur complexite
from vertexai.generative_models import GenerativeModel

class GeminiRouter:
    def __init__(self):
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def classify_complexity(self, prompt: str) -> str:
        """Classifier la complexite de la requete"""
        length = len(prompt)

        # Regles simples
        if length < 100:
            return "simple"
        elif "analyse" in prompt.lower() or "compare" in prompt.lower():
            return "complex"
        elif "summarize" in prompt.lower() or "list" in prompt.lower():
            return "simple"
        else:
            return "standard"

    def route(self, prompt: str):
        """Router vers le bon modele"""
        complexity = self.classify_complexity(prompt)

        if complexity == "simple":
            # Flash-8B : $0.04/1M input (20x moins cher que Pro)
            model = self.flash_8b
            print("โ†’ Routing vers Flash-8B (simple)")
        elif complexity == "complex":
            # Pro : meilleure qualite pour raisonnement
            model = self.pro
            print("โ†’ Routing vers Pro (complexe)")
        else:
            # Flash : equilibre cout/qualite
            model = self.flash
            print("โ†’ Routing vers Flash (standard)")

        return model.generate_content(prompt)

# Utilisation
router = GeminiRouter()

# Simple : Flash-8B ($0.000004 input)
response1 = router.route("Quelle est la capitale de la France ?")

# Standard : Flash ($0.000015 input)
response2 = router.route("Resume les 3 avantages principaux du cloud computing")

# Complexe : Pro ($0.00300 input)
response3 = router.route("Analyse comparative detaillee entre architecture monolithique et microservices avec cas d'usage specifiques")

# Economie : ~70% sur volume mixte

๐Ÿ’พ 2. Context Caching (-90% sur cache)

python
from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel
import datetime

# Creer cached content (system instruction longue)
system_instruction = """
Tu es un assistant support technique pour notre produit SaaS.
[... 5000 tokens de documentation produit ...]
Voici les 200 questions/reponses FAQ les plus frequentes :
[... documentation complete ...]
"""

cached_content = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction=system_instruction,
    ttl=datetime.timedelta(hours=1),  # Cache 1h
)

# Utiliser cache pour multiples requetes
model = GenerativeModel.from_cached_content(cached_content)

# Requete 1 : Paye 5000 tokens cached ($0.000075) au lieu de input ($0.00075)
response1 = model.generate_content("Comment reinitialiser mon mot de passe ?")

# Requete 2-1000 : Cache hit, economie massive
response2 = model.generate_content("Ou trouver mes factures ?")

# ECONOMIE :
# Sans cache : 1000 requetes ร— 5000 tokens input ร— $0.15/1M = $0.75
# Avec cache : 1 creation ร— $0.000075 + 1000 ร— cache hit (quasi-gratuit) = $0.075
# โ†’ 90% d'economie !
๐Ÿ’ก Quand utiliser Context Caching ?
System instruction >2000 tokens, reutilisee >5 fois, TTL >5 minutes. ROI positif des 5 requetes.

๐Ÿ“ฆ 3. Batch API (-50% cout)

python
import json
from google.cloud import aiplatform

# Preparer batch requests (JSONL)
batch_requests = []
with open("questions.txt") as f:
    for line in f:
        batch_requests.append({
            "request": {
                "contents": [{"role": "user", "parts": [{"text": line.strip()}]}]
            }
        })

# Ecrire JSONL
with open("batch_input.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

# Upload to GCS
from google.cloud import storage
bucket = storage.Client().bucket("my-batch-bucket")
blob = bucket.blob("batch_input.jsonl")
blob.upload_from_filename("batch_input.jsonl")

# Submit batch job
batch_job = aiplatform.BatchPredictionJob.create(
    job_display_name="gemini-batch-job",
    model_name="gemini-2.5-flash",
    input_uri="gs://my-batch-bucket/batch_input.jsonl",
    output_uri="gs://my-batch-bucket/output/",
)

print(f"Batch job created: {batch_job.name}")
print("Processing time: 10-30 minutes")
print("Cost reduction: 50% vs real-time API")

# ECONOMIE :
# Real-time : 10,000 requetes ร— $0.15/1M input = $1.50
# Batch : 10,000 requetes ร— $0.075/1M input = $0.75
# โ†’ 50% d'economie si non-urgent

โœ‚๏ธ 4. Prompt Compression

python
# AVANT (verbose, 250 tokens)
prompt_verbose = """
Je voudrais que tu m'aides a comprendre le concept de machine learning.
Peux-tu s'il te plait m'expliquer ce que c'est de maniere simple ?
J'aimerais aussi savoir quelles sont les principales applications.
Et si possible, donne-moi quelques exemples concrets.
Merci beaucoup pour ton aide !
"""

# APRES (concis, 120 tokens, -52%)
prompt_concis = """
Explique machine learning simplement : definition, applications, exemples concrets.
"""

# TECHNIQUE : Supprimer politesse, redondances, aller droit au but
# Economie : 52% tokens input sur ce prompt

# Pour system instructions :
system_before = """
You are a helpful assistant. You should always be polite and professional.
When answering questions, make sure to provide detailed explanations.
If you don't know something, be honest about it.
Always format your responses in a clear and readable way.
"""  # 150 tokens

system_after = """
Assistant technique. Reponses detaillees, format clair, honnete sur limites.
"""  # 50 tokens (-67%)

๐ŸŽš๏ธ 5. Output Control

python
from vertexai.generative_models import GenerativeModel, GenerationConfig

model = GenerativeModel("gemini-2.5-flash")

# MAUVAIS : Output non controle (peut faire 2000 tokens)
response_uncontrolled = model.generate_content(
    "Liste les pays europeens"
)
# โ†’ Peut generer 2000 tokens = $0.0012 output

# BON : Output controle avec max_output_tokens
response_controlled = model.generate_content(
    "Liste les pays europeens",
    generation_config=GenerationConfig(
        max_output_tokens=200,  # Limite stricte
        temperature=0.3,  # Moins creative = plus court
    )
)
# โ†’ Maximum 200 tokens = $0.00012 output
# โ†’ Economie 90%

# Pour JSON : schema strict = output deterministe court
response_json = model.generate_content(
    "Top 3 pays europeens par PIB",
    generation_config=GenerationConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "country": {"type": "string"},
                    "gdp": {"type": "number"}
                }
            },
            "maxItems": 3
        }
    )
)
# โ†’ Output JSON compact, pas de texte superflu

โšก 6. Utiliser Flash-8B pour Use Cases Simples

Use Case Modele Recommande Cout Input Economie vs Pro
FAQ / Support simple Flash-8B $0.04/1M 75x moins cher
Classification Flash-8B $0.04/1M 75x moins cher
Extraction entites Flash $0.15/1M 20x moins cher
Resume documents Flash $0.15/1M 20x moins cher
Analyse complexe Pro $3.00/1M Justifie si qualite critique

๐Ÿ”„ 7. Implicit Context Caching (Auto)

Gemini 2.5 cache automatiquement les prefixes longs communs (>1024 tokens) pendant 5 minutes. Pas de config necessaire.

python
# Si vous envoyez meme long prefix dans les 5 min :
long_context = "[... 3000 tokens documentation ...]"

# Requete 1 : Full cost
response1 = model.generate_content(long_context + "\nQuestion 1 ?")

# Requete 2 (dans les 5 min) : Implicit cache hit !
response2 = model.generate_content(long_context + "\nQuestion 2 ?")
# โ†’ Google detecte automatiquement prefix identique
# โ†’ Cache hit gratuit (si >1024 tokens prefix)

# Astuce : Structurer prompts avec context stable en debut

En combinant ces 7 techniques, vous pouvez reduire vos couts de 60-80%. Commencez par model routing et context caching (quick wins), puis optimisez prompts et output. Batch API pour traitement non-urgent. Mesurez avant/apres pour quantifier ROI.

Context Caching Avance

โฑ 35 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre implicit vs explicit caching
  • Optimiser TTL pour maximiser ROI
  • Implementer warming strategies
  • Calculer ROI caching pour votre use case

๐Ÿ”„ Implicit vs Explicit Caching

Aspect Implicit Caching Explicit Caching
Activation Automatique (Gemini 2.5+) Manuel via API
Taille min >1024 tokens prefix >2048 tokens
TTL 5 minutes fixe 1-60 minutes configurable
Cout cache Gratuit $0.015/1M tokens (Flash)
Use case Conversations courtes System instructions longues

๐Ÿ’ฐ Calculateur ROI Context Caching

python
class CachingROICalculator:
    def __init__(self, model="flash"):
        if model == "flash":
            self.input_price = 0.15  # $/1M tokens
            self.cache_price = 0.015  # $/1M tokens (10x moins)
        elif model == "pro":
            self.input_price = 3.00
            self.cache_price = 0.30

    def calculate_roi(self,
                      cached_tokens: int,
                      num_requests: int,
                      ttl_minutes: int):
        """Calculer ROI du caching"""

        # SANS CACHE
        cost_without_cache = (
            cached_tokens * num_requests / 1_000_000 * self.input_price
        )

        # AVEC CACHE
        # Creation cache : 1 fois
        cache_creation = cached_tokens / 1_000_000 * self.input_price
        # Hits cache : num_requests fois
        cache_hits = cached_tokens * num_requests / 1_000_000 * self.cache_price
        # Storage : ttl_minutes
        cache_storage = cached_tokens / 1_000_000 * self.cache_price * (ttl_minutes / 60)

        cost_with_cache = cache_creation + cache_hits + cache_storage

        # ROI
        savings = cost_without_cache - cost_with_cache
        savings_pct = (savings / cost_without_cache) * 100
        breakeven_requests = cache_creation / (
            cached_tokens / 1_000_000 * (self.input_price - self.cache_price)
        )

        return {
            "cost_without_cache": cost_without_cache,
            "cost_with_cache": cost_with_cache,
            "savings": savings,
            "savings_pct": savings_pct,
            "breakeven_requests": int(breakeven_requests) + 1
        }

# Exemple : Chatbot support avec system instruction 5000 tokens
calc = CachingROICalculator(model="flash")

# Scenario 1 : 10 requetes/heure, cache 1h
result1 = calc.calculate_roi(
    cached_tokens=5000,
    num_requests=10,
    ttl_minutes=60
)

print("Scenario 1 : 10 req/h, TTL 1h")
print(f"  Sans cache: ${result1['cost_without_cache']:.6f}")
print(f"  Avec cache: ${result1['cost_with_cache']:.6f}")
print(f"  Economie: ${result1['savings']:.6f} ({result1['savings_pct']:.1f}%)")
print(f"  Breakeven: {result1['breakeven_requests']} requetes")
print()

# Scenario 2 : 100 requetes/heure, cache 1h
result2 = calc.calculate_roi(
    cached_tokens=5000,
    num_requests=100,
    ttl_minutes=60
)

print("Scenario 2 : 100 req/h, TTL 1h")
print(f"  Sans cache: ${result2['cost_without_cache']:.6f}")
print(f"  Avec cache: ${result2['cost_with_cache']:.6f}")
print(f"  Economie: ${result2['savings']:.6f} ({result2['savings_pct']:.1f}%)")
print(f"  Breakeven: {result2['breakeven_requests']} requetes")

# SORTIE :
# Scenario 1 : 10 req/h, TTL 1h
#   Sans cache: $0.007500
#   Avec cache: $0.001575
#   Economie: $0.005925 (79.0%)
#   Breakeven: 2 requetes

# Scenario 2 : 100 req/h, TTL 1h
#   Sans cache: $0.075000
#   Avec cache: $0.008250
#   Economie: $0.066750 (89.0%)
#   Breakeven: 2 requetes

# โ†’ ROI positif des 2 requetes !

โฑ๏ธ Optimisation TTL

๐Ÿ“Š Formule TTL Optimal
TTL optimal = Intervalle moyen entre requetes ร— 5
Exemple : Si requetes toutes les 2 min โ†’ TTL = 10 min
python
import datetime
from vertexai.preview import caching

# Analyser pattern de traffic pour definir TTL
def optimize_ttl(request_intervals_minutes: list) -> int:
    """Calculer TTL optimal base sur pattern traffic"""
    avg_interval = sum(request_intervals_minutes) / len(request_intervals_minutes)
    optimal_ttl = int(avg_interval * 5)

    # Contraintes : 1-60 minutes
    if optimal_ttl < 1:
        return 1
    elif optimal_ttl > 60:
        return 60
    else:
        return optimal_ttl

# Exemple : Chatbot avec pic traffic 9h-18h
# Requetes toutes les 3 min en moyenne
intervals = [3, 2, 4, 3, 5, 2, 3, 4]  # minutes
optimal_ttl = optimize_ttl(intervals)
print(f"TTL optimal: {optimal_ttl} minutes")  # โ†’ 15 minutes

# Creer cache avec TTL optimal
cached_content = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction="[... 5000 tokens ...]",
    ttl=datetime.timedelta(minutes=optimal_ttl),
)

# Alternative : TTL absolu (expire a heure precise)
# Utile pour cache qui doit expirer en fin de journee
expire_time = datetime.datetime.now() + datetime.timedelta(hours=8)
cached_content_absolute = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction="[... 5000 tokens ...]",
    expire_time=expire_time,  # Expire a 18h
)

๐Ÿ”ฅ Warming Strategies

Probleme : Si cache expire pendant pic traffic, premiere requete lente (cold start).

Solution : Cache warming preemptif.

python
import time
import threading
from datetime import datetime, timedelta
from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel

class CacheWarmer:
    def __init__(self, system_instruction: str, ttl_minutes: int):
        self.system_instruction = system_instruction
        self.ttl_minutes = ttl_minutes
        self.cached_content = None
        self.model = None
        self.warming_thread = None

    def create_cache(self):
        """Creer ou renouveler cache"""
        self.cached_content = caching.CachedContent.create(
            model_name="gemini-2.5-flash",
            system_instruction=self.system_instruction,
            ttl=timedelta(minutes=self.ttl_minutes),
        )
        self.model = GenerativeModel.from_cached_content(self.cached_content)
        print(f"[{datetime.now()}] Cache created/renewed")

    def start_warming(self):
        """Demarrer warming automatique"""
        self.create_cache()

        # Renouveler cache avant expiration
        refresh_interval = (self.ttl_minutes - 1) * 60  # 1 min avant expiration

        def warming_loop():
            while True:
                time.sleep(refresh_interval)
                self.create_cache()

        self.warming_thread = threading.Thread(target=warming_loop, daemon=True)
        self.warming_thread.start()

    def generate(self, prompt: str):
        """Generate avec cache toujours chaud"""
        if self.model is None:
            raise RuntimeError("Cache not initialized. Call start_warming() first.")
        return self.model.generate_content(prompt)

# Utilisation : Cache toujours chaud pendant heures bureau
warmer = CacheWarmer(
    system_instruction="[... 5000 tokens system instruction ...]",
    ttl_minutes=30
)

# Demarrer warming (renouvelle cache toutes les 29 min)
warmer.start_warming()

# Toutes les requetes utilisent cache chaud (pas de cold start)
response1 = warmer.generate("Question 1")
time.sleep(1800)  # 30 min plus tard
response2 = warmer.generate("Question 2")  # Cache renouvele automatiquement !

# Economie : Pas de cold start, latence optimale

๐Ÿ“Š Comparaison Couts : Cache vs No Cache

Scenario Cached Tokens Requests/Day Cost No Cache Cost With Cache Savings
Chatbot support 5,000 10,000 $7.50 $0.83 89%
RAG system 20,000 5,000 $15.00 $1.65 89%
Agent avec tools 10,000 1,000 $1.50 $0.17 89%
Code assistant 30,000 20,000 $90.00 $9.90 89%
โš ๏ธ Quand NE PAS utiliser Cache
  • System instruction <2000 tokens (ROI negatif)
  • Moins de 5 requetes pendant TTL (breakeven non atteint)
  • Context change frequemment (invalidation cache trop souvent)
  • Implicit cache suffit (prefix >1024 tokens, requetes <5 min)

๐Ÿ› ๏ธ Cache Management Best Practices

python
from vertexai.preview import caching

# 1. Lister tous les caches actifs
caches = caching.CachedContent.list()
for cache in caches:
    print(f"Cache: {cache.name}")
    print(f"  Model: {cache.model}")
    print(f"  Expire: {cache.expire_time}")
    print(f"  Size: {len(cache.system_instruction)} chars")

# 2. Supprimer cache manuellement si context change
cache_to_delete = caching.CachedContent(cached_content_name="cache-123")
cache_to_delete.delete()
print("Cache deleted")

# 3. Mettre a jour TTL d'un cache existant
cache_to_update = caching.CachedContent(cached_content_name="cache-456")
cache_to_update.update(ttl=timedelta(minutes=120))  # Extend TTL
print("TTL updated")

# 4. Monitoring usage cache
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()

query = """
fetch aiplatform.googleapis.com/prediction/cache_hit_count
| group_by 1h, [value_cache_hit_count_mean: mean(value.cache_hit_count)]
| every 1h
"""

# 5. Alert si cache hit rate < 80% (probleme TTL ou invalidation)
# โ†’ Creer alerte Cloud Monitoring sur cache_hit_rate metric

Context caching est votre meilleur allie FinOps. Pour chatbot/RAG avec system instruction longue, ROI est positif des 2 requetes. Commencez avec TTL conservateur (30 min), puis ajustez base sur metriques. Implicit cache gratuit pour conversations courtes. Warming pour apps critiques latence.

Model Routing Intelligent

โฑ 30 min Avance

๐ŸŽฏ Objectifs d'apprentissage

  • Implementer classifier de requetes multi-niveau
  • Router Pro/Flash/Flash-8B intelligemment
  • Gerer fallback et error handling
  • Mesurer quality/cost tradeoff

๐ŸŽฏ Architecture Model Router

User Query Complexity Classifier (Flash-8B) Flash-8B Simple Flash Standard Pro Complex Cout: $0.04/1M Cout: $0.15/1M Cout: $3.00/1M 70% traffic โ†’ 25% traffic โ†’ 5% traffic โ†’

๐Ÿง  Classifier Implementation

python
from vertexai.generative_models import GenerativeModel, GenerationConfig
import json
from enum import Enum

class ComplexityLevel(Enum):
    SIMPLE = "simple"
    STANDARD = "standard"
    COMPLEX = "complex"

class IntelligentRouter:
    def __init__(self):
        # Classifier ultra-rapide avec Flash-8B
        self.classifier = GenerativeModel("gemini-2.5-flash-8b")

        # 3 modeles production
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def classify_complexity(self, prompt: str) -> ComplexityLevel:
        """Classifier requete avec LLM (Flash-8B)"""

        classification_prompt = f"""
Analyse cette requete utilisateur et determine sa complexite :
- SIMPLE : FAQ, recherche info factuelle, classification basique
- STANDARD : Resume, extraction donnees, generation texte standard
- COMPLEX : Analyse multi-etapes, raisonnement logique, creative writing

Requete : "{prompt}"

Reponds UNIQUEMENT par JSON :
{{"complexity": "simple|standard|complex", "reasoning": "explication courte"}}
"""

        response = self.classifier.generate_content(
            classification_prompt,
            generation_config=GenerationConfig(
                response_mime_type="application/json",
                max_output_tokens=100,
                temperature=0.1,
            )
        )

        result = json.loads(response.text)
        complexity = ComplexityLevel(result["complexity"])

        print(f"[Classifier] {complexity.value.upper()}: {result['reasoning']}")
        return complexity

    def route_and_generate(self, prompt: str, temperature: float = 0.7):
        """Router et generer reponse"""

        # 1. Classifier (coute ~$0.000004)
        complexity = self.classify_complexity(prompt)

        # 2. Selectionner modele
        if complexity == ComplexityLevel.SIMPLE:
            model = self.flash_8b
            model_name = "Flash-8B"
        elif complexity == ComplexityLevel.STANDARD:
            model = self.flash
            model_name = "Flash"
        else:  # COMPLEX
            model = self.pro
            model_name = "Pro"

        print(f"[Router] โ†’ {model_name}")

        # 3. Generer avec fallback
        try:
            response = model.generate_content(
                prompt,
                generation_config=GenerationConfig(
                    temperature=temperature,
                    max_output_tokens=2048,
                )
            )
            return response.text, model_name

        except Exception as e:
            # Fallback vers modele superieur si echec
            print(f"[Router] Error with {model_name}, falling back to Pro")
            response = self.pro.generate_content(prompt)
            return response.text, "Pro (fallback)"

# Test router
router = IntelligentRouter()

# Requete simple โ†’ Flash-8B
response1, model1 = router.route_and_generate(
    "Quelle est la capitale de l'Italie ?"
)
print(f"Model: {model1}\nReponse: {response1}\n")

# Requete standard โ†’ Flash
response2, model2 = router.route_and_generate(
    "Resume les 3 principales caracteristiques du cloud computing"
)
print(f"Model: {model2}\nReponse: {response2}\n")

# Requete complexe โ†’ Pro
response3, model3 = router.route_and_generate(
    "Analyse les implications ethiques de l'IA dans le systeme judiciaire, "
    "en considerant les biais algorithmiques et la transparence des decisions"
)
print(f"Model: {model3}\nReponse: {response3}\n")

๐Ÿ“Š Quality/Cost Tradeoff Analysis

python
import time
from dataclasses import dataclass

@dataclass
class RoutingMetrics:
    model: str
    latency_ms: float
    cost_usd: float
    quality_score: float  # 0-100, evaluation humaine ou automatique

class RouterAnalyzer:
    def __init__(self):
        self.metrics = []

    def evaluate_routing(self,
                          test_queries: list,
                          router: IntelligentRouter):
        """Evaluer quality/cost tradeoff"""

        total_cost = 0
        total_latency = 0
        total_quality = 0

        for query in test_queries:
            start = time.time()
            response, model = router.route_and_generate(query)
            latency = (time.time() - start) * 1000

            # Estimer cout base sur tokens (simplifie)
            tokens_estimate = len(query.split()) * 1.3 + len(response.split()) * 1.3
            if "Flash-8B" in model:
                cost = tokens_estimate / 1_000_000 * 0.20  # Input + output
            elif "Flash" in model:
                cost = tokens_estimate / 1_000_000 * 0.75
            else:  # Pro
                cost = tokens_estimate / 1_000_000 * 15.00

            # Quality score (simuler evaluation - en prod, utiliser LLM judge)
            quality = self._evaluate_quality(query, response)

            self.metrics.append(RoutingMetrics(
                model=model,
                latency_ms=latency,
                cost_usd=cost,
                quality_score=quality
            ))

            total_cost += cost
            total_latency += latency
            total_quality += quality

        # Calculer moyennes
        n = len(test_queries)
        avg_cost = total_cost / n
        avg_latency = total_latency / n
        avg_quality = total_quality / n

        print("=== ROUTING ANALYSIS ===")
        print(f"Total queries: {n}")
        print(f"Avg cost/query: ${avg_cost:.6f}")
        print(f"Avg latency: {avg_latency:.0f}ms")
        print(f"Avg quality: {avg_quality:.1f}/100")
        print(f"\nTotal cost: ${total_cost:.4f}")

        # Distribution modeles
        model_counts = {}
        for m in self.metrics:
            model_counts[m.model] = model_counts.get(m.model, 0) + 1

        print("\n=== MODEL DISTRIBUTION ===")
        for model, count in sorted(model_counts.items()):
            pct = count / n * 100
            print(f"{model}: {count} ({pct:.1f}%)")

        return {
            "avg_cost": avg_cost,
            "avg_latency": avg_latency,
            "avg_quality": avg_quality,
            "model_distribution": model_counts
        }

    def _evaluate_quality(self, query: str, response: str) -> float:
        """Evaluer qualite reponse (simplifie)"""
        # En production : utiliser LLM judge ou human evaluation
        # Ici : heuristique simple
        if len(response) < 50:
            return 60.0
        elif "sorry" in response.lower() or "cannot" in response.lower():
            return 40.0
        else:
            return 85.0

# Test avec dataset
test_queries = [
    "Capitale du Japon ?",
    "Liste 3 langages de programmation",
    "Explique la photosynthese simplement",
    "Compare architecture REST vs GraphQL en detail",
    "Analyse critique de la blockchain pour supply chain avec exemples concrets",
]

analyzer = RouterAnalyzer()
results = analyzer.evaluate_routing(test_queries, router)

# SORTIE EXEMPLE :
# === ROUTING ANALYSIS ===
# Total queries: 5
# Avg cost/query: $0.000180
# Avg latency: 1250ms
# Avg quality: 82.0/100
#
# Total cost: $0.0009
#
# === MODEL DISTRIBUTION ===
# Flash: 2 (40.0%)
# Flash-8B: 2 (40.0%)
# Pro: 1 (20.0%)

๐Ÿ›ก๏ธ Fallback Strategy

python
from typing import Optional

class RobustRouter:
    def __init__(self):
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def generate_with_fallback(self,
                                prompt: str,
                                preferred_model: str = "flash") -> dict:
        """Generate avec fallback cascade"""

        # Definir cascade
        if preferred_model == "flash-8b":
            cascade = [self.flash_8b, self.flash, self.pro]
            cascade_names = ["Flash-8B", "Flash", "Pro"]
        elif preferred_model == "flash":
            cascade = [self.flash, self.pro]
            cascade_names = ["Flash", "Pro"]
        else:  # pro
            cascade = [self.pro]
            cascade_names = ["Pro"]

        # Essayer cascade
        last_error = None
        for model, name in zip(cascade, cascade_names):
            try:
                print(f"[Fallback] Trying {name}...")
                response = model.generate_content(
                    prompt,
                    generation_config=GenerationConfig(
                        max_output_tokens=2048,
                        temperature=0.7,
                    )
                )

                # Verifier qualite reponse
                if response.text and len(response.text) > 10:
                    print(f"[Fallback] โœ“ Success with {name}")
                    return {
                        "text": response.text,
                        "model": name,
                        "fallback": name != cascade_names[0]
                    }
                else:
                    raise ValueError("Response too short")

            except Exception as e:
                print(f"[Fallback] โœ— {name} failed: {e}")
                last_error = e
                continue

        # Tous les modeles ont echoue
        raise RuntimeError(f"All models failed. Last error: {last_error}")

# Test fallback
robust_router = RobustRouter()

# Requete normale : Flash suffit
result1 = robust_router.generate_with_fallback(
    "Explique REST API",
    preferred_model="flash"
)
print(f"Model: {result1['model']}, Fallback: {result1['fallback']}\n")

# Requete tres longue : Flash echoue โ†’ Pro fallback
# (simuler echec en ajoutant prompt trop long pour Flash)
long_prompt = "Analyse " + " ".join(["cette situation complexe"] * 10000)
try:
    result2 = robust_router.generate_with_fallback(
        long_prompt,
        preferred_model="flash"
    )
    print(f"Model: {result2['model']}, Fallback: {result2['fallback']}\n")
except Exception as e:
    print(f"Error: {e}")

๐Ÿ’ก Regles de Routing Optimales

Use Case Modele Raison
FAQ / Support Tier 1 Flash-8B Reponses factuelles, latence critique, volume eleve
Classification / Tagging Flash-8B Sortie JSON, deterministe, rapide
Extraction entites Flash Precision > vitesse, output structure
Resume documents Flash Equilibre qualite/cout, context long
Code generation Flash Syntaxe correcte, output deterministe
Analyse complexe Pro Raisonnement multi-etapes, nuance
Creative writing Pro Creativite, style, coherence longue
Research synthesis Pro Comprehension profonde, cross-referencing

Model routing intelligent peut reduire couts de 60-70% sans degrader qualite. Utilisez Flash-8B pour 70% requetes (FAQ, classification), Flash pour 25% (summaries, extraction), Pro pour 5% seulement (analyse complexe). Classifier coute $0.000004, ROI positif immediate. Fallback vers Pro = safety net si Flash echoue.

Batch API & Traitement Asynchrone

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre Batch API et -50% reduction cout
  • Implementer workflows JSONL batch
  • Utiliser SDK OpenAI compatible
  • Monitorer et gerer batch jobs

๐Ÿ“ฆ Batch API : -50% Cout pour Traitement Non-Urgent

๐Ÿ’ก Quand utiliser Batch API ?
  • Traitement asynchrone acceptable (10-30 minutes)
  • Volume eleve (>1000 requetes)
  • Use cases : ETL, data enrichment, bulk classification, offline evaluation
  • Economie : 50% vs real-time API

๐Ÿ”„ Workflow Batch API

1. Prepare JSONL 2. Upload GCS 3. Submit Job 4. Monitor 5. Download Results Temps total : 10-30 min | Economie : 50%

๐Ÿ“ Implementation Complete

python
import json
from google.cloud import storage, aiplatform
from datetime import datetime
import time

class GeminiBatchProcessor:
    def __init__(self,
                 project_id: str,
                 location: str,
                 bucket_name: str):
        self.project_id = project_id
        self.location = location
        self.bucket_name = bucket_name

        aiplatform.init(project=project_id, location=location)
        self.storage_client = storage.Client()

    def prepare_batch_jsonl(self,
                             prompts: list[str],
                             output_file: str = "batch_input.jsonl"):
        """Preparer fichier JSONL pour batch"""

        with open(output_file, "w") as f:
            for i, prompt in enumerate(prompts):
                request = {
                    "request": {
                        "contents": [
                            {
                                "role": "user",
                                "parts": [{"text": prompt}]
                            }
                        ]
                    }
                }
                f.write(json.dumps(request) + "\n")

        print(f"โœ“ Created {output_file} with {len(prompts)} requests")
        return output_file

    def upload_to_gcs(self, local_file: str, gcs_path: str):
        """Upload fichier vers GCS"""

        bucket = self.storage_client.bucket(self.bucket_name)
        blob = bucket.blob(gcs_path)
        blob.upload_from_filename(local_file)

        gcs_uri = f"gs://{self.bucket_name}/{gcs_path}"
        print(f"โœ“ Uploaded to {gcs_uri}")
        return gcs_uri

    def submit_batch_job(self,
                          input_uri: str,
                          output_uri_prefix: str,
                          model_name: str = "gemini-2.5-flash"):
        """Submit batch prediction job"""

        batch_job = aiplatform.BatchPredictionJob.create(
            job_display_name=f"gemini-batch-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
            model_name=model_name,
            input_config_source=input_uri,
            output_config_destination=output_uri_prefix,
        )

        print(f"โœ“ Batch job submitted: {batch_job.name}")
        print(f"  Status: {batch_job.state}")
        return batch_job

    def monitor_job(self, batch_job, poll_interval: int = 60):
        """Monitorer job jusqu'a completion"""

        print(f"Monitoring job {batch_job.display_name}...")

        while batch_job.state not in ["SUCCEEDED", "FAILED", "CANCELLED"]:
            time.sleep(poll_interval)
            batch_job.refresh()
            print(f"  Status: {batch_job.state} ({datetime.now().strftime('%H:%M:%S')})")

        if batch_job.state == "SUCCEEDED":
            print(f"โœ“ Job completed successfully!")
            return True
        else:
            print(f"โœ— Job failed: {batch_job.error}")
            return False

    def download_results(self, output_uri_prefix: str, local_file: str = "batch_output.jsonl"):
        """Download resultats depuis GCS"""

        # Parse GCS URI
        parts = output_uri_prefix.replace("gs://", "").split("/")
        bucket_name = parts[0]
        prefix = "/".join(parts[1:])

        # Lister fichiers output
        bucket = self.storage_client.bucket(bucket_name)
        blobs = list(bucket.list_blobs(prefix=prefix))

        # Download tous les fichiers (batch peut splitter en plusieurs)
        results = []
        for blob in blobs:
            if blob.name.endswith(".jsonl"):
                content = blob.download_as_text()
                for line in content.strip().split("\n"):
                    results.append(json.loads(line))

        # Sauver local
        with open(local_file, "w") as f:
            for result in results:
                f.write(json.dumps(result) + "\n")

        print(f"โœ“ Downloaded {len(results)} results to {local_file}")
        return results

# EXAMPLE : Batch classification de 10,000 feedbacks clients
processor = GeminiBatchProcessor(
    project_id="my-project",
    location="us-central1",
    bucket_name="my-batch-bucket"
)

# 1. Preparer prompts
feedbacks = [
    "Le produit est excellent, livraison rapide !",
    "Service client nul, attente 2h au telephone",
    # ... 9,998 autres feedbacks
]

classification_prompts = [
    f"Classifie ce feedback client en POSITIF, NEGATIF ou NEUTRE. "
    f"Feedback: \"{fb}\"\nReponse (un seul mot):"
    for fb in feedbacks
]

# 2. Creer JSONL
input_file = processor.prepare_batch_jsonl(classification_prompts)

# 3. Upload GCS
input_uri = processor.upload_to_gcs(
    input_file,
    "batch_jobs/classification_input.jsonl"
)

# 4. Submit job
output_uri = f"gs://{processor.bucket_name}/batch_jobs/output/"
batch_job = processor.submit_batch_job(
    input_uri=input_uri,
    output_uri_prefix=output_uri,
    model_name="gemini-2.5-flash"  # -50% vs real-time
)

# 5. Monitor (bloquant)
success = processor.monitor_job(batch_job, poll_interval=60)

# 6. Download resultats
if success:
    results = processor.download_results(output_uri)

    # Parser resultats
    classifications = []
    for result in results:
        text = result["response"]["candidates"][0]["content"]["parts"][0]["text"]
        classifications.append(text.strip().upper())

    # Stats
    from collections import Counter
    counts = Counter(classifications)
    print("\n=== RESULTS ===")
    print(f"Positif: {counts['POSITIF']}")
    print(f"Negatif: {counts['NEGATIF']}")
    print(f"Neutre: {counts['NEUTRE']}")

# ECONOMIE :
# Real-time : 10,000 req ร— 100 tokens avg ร— $0.15/1M input + $0.60/1M output
#           = $0.15 + $0.60 = $0.75
# Batch API : $0.75 ร— 0.5 = $0.375
# Economie : $0.375 (50%)

๐Ÿ”ง SDK OpenAI Compatible

Vertex AI Batch API est compatible avec SDK OpenAI pour faciliter migration.

python
# Installation
# pip install google-cloud-aiplatform openai

from openai import OpenAI
import os

# Configurer client OpenAI avec endpoint Vertex AI
client = OpenAI(
    base_url=f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openai",
    api_key=os.environ.get("GOOGLE_API_KEY")  # Utiliser ADC en prod
)

# Creer batch file (format OpenAI)
with open("batch_openai.jsonl", "w") as f:
    for prompt in prompts:
        request = {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gemini-2.5-flash",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 200
            }
        }
        f.write(json.dumps(request) + "\n")

# Upload batch file
batch_input_file = client.files.create(
    file=open("batch_openai.jsonl", "rb"),
    purpose="batch"
)

# Create batch job
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

# Poll status
while batch.status not in ["completed", "failed", "cancelled"]:
    time.sleep(60)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status}")

# Download results
if batch.status == "completed":
    result_file = client.files.content(batch.output_file_id)
    with open("batch_results.jsonl", "wb") as f:
        f.write(result_file.read())

๐Ÿ“Š Monitoring Batch Jobs

python
from google.cloud import aiplatform

# Lister tous les batch jobs
batch_jobs = aiplatform.BatchPredictionJob.list(
    filter='display_name:gemini-batch-*',
    order_by='create_time desc'
)

print("=== BATCH JOBS ===")
for job in batch_jobs:
    print(f"Name: {job.display_name}")
    print(f"  State: {job.state}")
    print(f"  Created: {job.create_time}")
    print(f"  Model: {job.model_name}")

    if job.state == "SUCCEEDED":
        # Calculer metriques
        elapsed = (job.end_time - job.create_time).total_seconds()
        print(f"  Duration: {elapsed/60:.1f} minutes")

# Creer dashboard Cloud Monitoring
from google.cloud import monitoring_v3

query = """
fetch aiplatform.googleapis.com/prediction/batch_prediction_job/count
| filter resource.job_id =~ 'gemini-batch-.*'
| group_by [resource.state], 1d
| every 1d
"""

# Alertes sur batch job failures
# gcloud alpha monitoring policies create \
#   --notification-channels=CHANNEL_ID \
#   --display-name="Batch Job Failures" \
#   --condition-threshold-value=1 \
#   --condition-threshold-duration=300s
โš ๏ธ Limitations Batch API
  • Latence 10-30 minutes (non-realtime)
  • Pas de streaming
  • Pas de function calling (en beta)
  • Limite 50,000 requetes par job

Batch API offre 50% reduction cout pour workloads non-urgents. Utilisez pour ETL overnight, bulk classification, offline evaluation. Temps processing : 10-30 min. Si vous avez 100K+ requetes/jour et latence non-critique, economie annuelle peut atteindre $10-50K. Setup initial 1-2h, ROI immediate.

Monitoring des Couts

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Configurer Cloud Billing pour tracking IA
  • Creer budget alerts et seuils
  • Builder cost dashboards temps reel
  • Implementer attribution par projet/equipe

๐Ÿ’ฐ Architecture Cost Monitoring

Vertex AI Usage Cloud Billing BigQuery Export Budget Alerts Cost Dashboard FinOps Actions

๐Ÿ“Š Export Billing vers BigQuery

bash
# 1. Creer dataset BigQuery pour billing
bq mk --dataset --location=US my_project:billing_export

# 2. Activer Cloud Billing export (via Console ou gcloud)
# Console : Billing โ†’ Billing export โ†’ BigQuery export โ†’ Enable

# 3. Verifier export actif
bq ls my_project:billing_export
# โ†’ gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX

# 4. Query couts Vertex AI
bq query --use_legacy_sql=false '
SELECT
  service.description AS service,
  sku.description AS sku,
  SUM(cost) AS total_cost,
  SUM(usage.amount) AS usage_amount,
  usage.unit AS unit
FROM `my_project.billing_export.gcp_billing_export_v1_*`
WHERE service.description = "Vertex AI"
  AND _TABLE_SUFFIX BETWEEN "20260201" AND "20260210"
GROUP BY service, sku, unit
ORDER BY total_cost DESC
LIMIT 20
'

# SORTIE EXEMPLE :
# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
# โ”‚   service    โ”‚                sku                 โ”‚ total_cost โ”‚ usage_amount โ”‚  unit   โ”‚
# โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
# โ”‚ Vertex AI    โ”‚ Gemini 2.5 Flash Input Tokens      โ”‚     125.50 โ”‚   836666666  โ”‚ tokens  โ”‚
# โ”‚ Vertex AI    โ”‚ Gemini 2.5 Flash Output Tokens     โ”‚      85.30 โ”‚   142166666  โ”‚ tokens  โ”‚
# โ”‚ Vertex AI    โ”‚ Gemini 2.5 Pro Input Tokens        โ”‚      45.20 โ”‚    15066666  โ”‚ tokens  โ”‚
# โ”‚ Vertex AI    โ”‚ Context Caching Storage            โ”‚       2.10 โ”‚   140000000  โ”‚ tokens  โ”‚
# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿšจ Budget Alerts

python
from google.cloud import billing_budgets_v1

def create_ai_budget_alert(
    billing_account_id: str,
    project_id: str,
    budget_amount: float,
    alert_thresholds: list = [0.5, 0.9, 1.0]
):
    """Creer budget alert pour Vertex AI"""

    client = billing_budgets_v1.BudgetServiceClient()

    # Configurer budget
    budget = billing_budgets_v1.Budget()
    budget.display_name = f"Vertex AI Budget - {project_id}"
    budget.budget_filter = billing_budgets_v1.Filter(
        projects=[f"projects/{project_id}"],
        services=["services/aiplatform.googleapis.com"],  # Vertex AI
    )

    # Montant mensuel
    budget.amount = billing_budgets_v1.BudgetAmount(
        specified_amount={"currency_code": "USD", "units": int(budget_amount)}
    )

    # Seuils d'alerte
    budget.threshold_rules = [
        billing_budgets_v1.ThresholdRule(
            threshold_percent=threshold,
            spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
        )
        for threshold in alert_thresholds
    ]

    # Creer budget
    parent = f"billingAccounts/{billing_account_id}"
    response = client.create_budget(parent=parent, budget=budget)

    print(f"โœ“ Budget created: {response.name}")
    print(f"  Amount: ${budget_amount}/month")
    print(f"  Alerts at: {', '.join([f'{int(t*100)}%' for t in alert_thresholds])}")

    return response

# Creer budget $1000/mois avec alertes a 50%, 90%, 100%
budget = create_ai_budget_alert(
    billing_account_id="012345-6789AB-CDEF01",
    project_id="my-ai-project",
    budget_amount=1000.0,
    alert_thresholds=[0.5, 0.9, 1.0]
)

# Configurer notification email/Pub/Sub
# Via Console : Billing โ†’ Budgets & alerts โ†’ Select budget โ†’ Manage notifications

๐Ÿ“ˆ Cost Dashboard en Temps Reel

python
from google.cloud import bigquery
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

class VertexAICostDashboard:
    def __init__(self, project_id: str, billing_dataset: str):
        self.client = bigquery.Client(project=project_id)
        self.billing_table = f"`{project_id}.{billing_dataset}.gcp_billing_export_v1_*`"

    def get_daily_costs(self, days: int = 30) -> pd.DataFrame:
        """Couts quotidiens Vertex AI"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          DATE(usage_start_time) AS date,
          SUM(cost) AS daily_cost
        FROM {self.billing_table}
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY date
        ORDER BY date
        """

        return self.client.query(query).to_dataframe()

    def get_costs_by_model(self, days: int = 7) -> pd.DataFrame:
        """Couts par modele Gemini"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          CASE
            WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
            WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
            WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
            ELSE 'Other'
          END AS model,
          SUM(cost) AS cost,
          SUM(usage.amount) AS tokens
        FROM {self.billing_table}
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY model
        ORDER BY cost DESC
        """

        return self.client.query(query).to_dataframe()

    def get_costs_by_label(self, label_key: str, days: int = 7) -> pd.DataFrame:
        """Couts par label (team, project, env)"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          labels.value AS {label_key},
          SUM(cost) AS cost
        FROM {self.billing_table},
        UNNEST(labels) AS labels
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
          AND labels.key = '{label_key}'
        GROUP BY {label_key}
        ORDER BY cost DESC
        """

        return self.client.query(query).to_dataframe()

    def plot_dashboard(self):
        """Generer dashboard visuel"""

        fig, axes = plt.subplots(2, 2, figsize=(15, 10))

        # 1. Daily costs trend
        daily = self.get_daily_costs(days=30)
        axes[0, 0].plot(daily['date'], daily['daily_cost'], marker='o')
        axes[0, 0].set_title('Daily Costs (30 days)')
        axes[0, 0].set_xlabel('Date')
        axes[0, 0].set_ylabel('Cost ($)')
        axes[0, 0].grid(True)

        # 2. Costs by model (pie chart)
        models = self.get_costs_by_model(days=7)
        axes[0, 1].pie(models['cost'], labels=models['model'], autopct='%1.1f%%')
        axes[0, 1].set_title('Costs by Model (7 days)')

        # 3. Costs by team
        teams = self.get_costs_by_label('team', days=7)
        axes[1, 0].bar(teams['team'], teams['cost'])
        axes[1, 0].set_title('Costs by Team (7 days)')
        axes[1, 0].set_xlabel('Team')
        axes[1, 0].set_ylabel('Cost ($)')
        axes[1, 0].tick_params(axis='x', rotation=45)

        # 4. Summary stats
        total_cost = daily['daily_cost'].sum()
        avg_daily = daily['daily_cost'].mean()
        forecast_monthly = avg_daily * 30

        summary_text = f"""
        === COST SUMMARY ===

        Last 30 days: ${total_cost:.2f}
        Avg daily: ${avg_daily:.2f}
        Forecast monthly: ${forecast_monthly:.2f}

        Top model: {models.iloc[0]['model']}
        Top model cost: ${models.iloc[0]['cost']:.2f}
        """
        axes[1, 1].text(0.1, 0.5, summary_text, fontsize=12, family='monospace')
        axes[1, 1].axis('off')

        plt.tight_layout()
        plt.savefig('vertex_ai_cost_dashboard.png', dpi=150)
        print("โœ“ Dashboard saved to vertex_ai_cost_dashboard.png")

# Generer dashboard
dashboard = VertexAICostDashboard(
    project_id="my-project",
    billing_dataset="billing_export"
)

dashboard.plot_dashboard()

# Pour dashboard temps reel : deployer sur Cloud Run + scheduler toutes les heures

๐Ÿท๏ธ Cost Attribution avec Labels

python
from vertexai.generative_models import GenerativeModel

# Labeler requetes par team/project/environment
def generate_with_labels(prompt: str, labels: dict):
    """Generate avec labels pour cost tracking"""

    # Labels format: key=value
    # Exemples : team=data-science, project=chatbot, env=prod

    model = GenerativeModel(
        "gemini-2.5-flash",
        # Labels attaches a chaque requete
        labels=labels
    )

    response = model.generate_content(prompt)
    return response.text

# Utilisation : tracer couts par equipe
response1 = generate_with_labels(
    "Resume ce document",
    labels={
        "team": "marketing",
        "project": "content-generation",
        "env": "prod"
    }
)

response2 = generate_with_labels(
    "Analyse ces donnees",
    labels={
        "team": "data-science",
        "project": "analytics",
        "env": "dev"
    }
)

# Query couts par team
# SELECT labels.value AS team, SUM(cost) AS cost
# FROM billing_table, UNNEST(labels) AS labels
# WHERE labels.key = 'team'
# GROUP BY team
# โ†’ Marketing: $450, Data Science: $780

# Chargeback : facturer equipes internes base sur usage reel

๐ŸŽฏ Cost Optimization Recommendations

python
def analyze_cost_optimization_opportunities(billing_df: pd.DataFrame) -> dict:
    """Analyser opportunites d'optimisation"""

    recommendations = []

    # 1. Detecter usage Pro pour requetes simples
    pro_usage = billing_df[billing_df['sku'].str.contains('2.5 Pro')]
    if not pro_usage.empty:
        pro_cost = pro_usage['cost'].sum()
        potential_savings = pro_cost * 0.95  # 95% si migration vers Flash
        recommendations.append({
            "type": "Model Downgrade",
            "current_cost": pro_cost,
            "potential_savings": potential_savings,
            "action": "Implementer model routing : Flash pour 80% requetes"
        })

    # 2. Detecter absence de caching
    cache_usage = billing_df[billing_df['sku'].str.contains('Cache')]
    if cache_usage.empty:
        input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
        potential_savings = input_cost * 0.5  # 50% avec caching
        recommendations.append({
            "type": "Context Caching",
            "current_cost": input_cost,
            "potential_savings": potential_savings,
            "action": "Activer context caching pour system instructions"
        })

    # 3. Detecter ratio input/output eleve (prompts longs)
    input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
    output_cost = billing_df[billing_df['sku'].str.contains('Output')]['cost'].sum()
    ratio = input_cost / output_cost if output_cost > 0 else 0

    if ratio > 3:
        potential_savings = input_cost * 0.3  # 30% avec prompt compression
        recommendations.append({
            "type": "Prompt Compression",
            "current_cost": input_cost,
            "potential_savings": potential_savings,
            "action": "Optimiser prompts : supprimer redondances, aller droit au but"
        })

    # 4. Calculer ROI total
    total_current = billing_df['cost'].sum()
    total_savings = sum([r['potential_savings'] for r in recommendations])
    savings_pct = (total_savings / total_current * 100) if total_current > 0 else 0

    return {
        "current_monthly_cost": total_current,
        "potential_monthly_savings": total_savings,
        "savings_percentage": savings_pct,
        "recommendations": recommendations
    }

# Exemple
recommendations = analyze_cost_optimization_opportunities(billing_df)
print(f"Current monthly cost: ${recommendations['current_monthly_cost']:.2f}")
print(f"Potential savings: ${recommendations['potential_monthly_savings']:.2f} ({recommendations['savings_percentage']:.1f}%)")
print("\nRecommendations:")
for i, rec in enumerate(recommendations['recommendations'], 1):
    print(f"{i}. {rec['type']}: Save ${rec['potential_savings']:.2f}/month")
    print(f"   Action: {rec['action']}")

Cost monitoring proactif = cle FinOps. Exportez billing vers BigQuery (gratuit), creez dashboards Looker Studio, configurez alertes a 50%/90%/100% budget. Utilisez labels pour attribution par team (chargeback). Revisez dashboard chaque semaine, identifiez anomalies, optimisez. Avec monitoring, vous detectez derive avant facture surprenante.

Lab : Dashboard FinOps Complet

โฑ 90 min Lab Pratique

๐ŸŽฏ Objectif du Lab

Construire un dashboard FinOps production-ready avec :

  • Cost tracking temps reel par modele/team
  • Trending analysis et forecasting
  • Alertes automatiques sur anomalies
  • Recommandations d'optimisation

Etape 1 : Setup BigQuery Export (10 min)

bash
# 1. Creer dataset billing
bq mk --dataset --location=US --description="Billing export for FinOps" \
  finops_lab:billing_data

# 2. Activer export (via Console)
# Billing โ†’ Billing export โ†’ BigQuery export โ†’ Enable
# Dataset: finops_lab:billing_data

# 3. Verifier export actif (attendre 5-10 min)
bq ls finops_lab:billing_data
# โ†’ gcp_billing_export_v1_XXXXXX

# 4. Tester query
bq query --use_legacy_sql=false '
SELECT service.description, SUM(cost) as cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
GROUP BY service.description
ORDER BY cost DESC
LIMIT 10
'

Etape 2 : Creer Views BigQuery (15 min)

sql
-- View 1 : Daily Vertex AI costs
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_daily_costs` AS
SELECT
  DATE(usage_start_time) AS date,
  CASE
    WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
    WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
    WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
    WHEN sku.description LIKE '%Cache%' THEN 'Context Caching'
    ELSE 'Other'
  END AS model,
  SUM(cost) AS cost,
  SUM(usage.amount) AS usage_amount,
  usage.unit
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE service.description = 'Vertex AI'
  AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
GROUP BY date, model, usage.unit;

-- View 2 : Costs by team (from labels)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_costs_by_team` AS
SELECT
  DATE(usage_start_time) AS date,
  labels.value AS team,
  SUM(cost) AS cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`,
UNNEST(labels) AS labels
WHERE service.description = 'Vertex AI'
  AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
  AND labels.key = 'team'
GROUP BY date, team;

-- View 3 : Anomaly detection (cost spike >50% vs avg)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_cost_anomalies` AS
WITH daily_costs AS (
  SELECT
    DATE(usage_start_time) AS date,
    SUM(cost) AS daily_cost
  FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
  WHERE service.description = 'Vertex AI'
    AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
  GROUP BY date
),
stats AS (
  SELECT
    AVG(daily_cost) AS avg_cost,
    STDDEV(daily_cost) AS stddev_cost
  FROM daily_costs
)
SELECT
  dc.date,
  dc.daily_cost,
  s.avg_cost,
  dc.daily_cost - s.avg_cost AS deviation,
  (dc.daily_cost - s.avg_cost) / s.avg_cost * 100 AS deviation_pct
FROM daily_costs dc, stats s
WHERE dc.daily_cost > s.avg_cost * 1.5  -- Spike >50%
ORDER BY dc.date DESC;

Etape 3 : Dashboard Python (30 min)

python
# finops_dashboard.py
from google.cloud import bigquery
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import smtplib
from email.mime.text import MIMEText

class FinOpsDashboard:
    def __init__(self, project_id: str):
        self.client = bigquery.Client(project=project_id)
        self.project_id = project_id

    def fetch_daily_costs(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_daily_costs`"
        return self.client.query(query).to_dataframe()

    def fetch_team_costs(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_costs_by_team`"
        return self.client.query(query).to_dataframe()

    def fetch_anomalies(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_cost_anomalies`"
        return self.client.query(query).to_dataframe()

    def generate_dashboard(self, output_file: str = "finops_dashboard.html"):
        """Generate interactive HTML dashboard"""

        # Fetch data
        daily_costs = self.fetch_daily_costs()
        team_costs = self.fetch_team_costs()
        anomalies = self.fetch_anomalies()

        # Create subplots
        fig = make_subplots(
            rows=3, cols=2,
            subplot_titles=(
                'Daily Costs by Model',
                'Model Distribution (Last 7 days)',
                'Costs by Team',
                'Cost Trend & Forecast',
                'Anomalies Detected',
                'Summary Metrics'
            ),
            specs=[
                [{"type": "scatter"}, {"type": "pie"}],
                [{"type": "bar"}, {"type": "scatter"}],
                [{"type": "scatter"}, {"type": "table"}]
            ]
        )

        # 1. Daily costs by model (line chart)
        for model in daily_costs['model'].unique():
            model_data = daily_costs[daily_costs['model'] == model]
            fig.add_trace(
                go.Scatter(
                    x=model_data['date'],
                    y=model_data['cost'],
                    name=model,
                    mode='lines+markers'
                ),
                row=1, col=1
            )

        # 2. Model distribution (pie chart - last 7 days)
        last_7d = daily_costs[daily_costs['date'] >= datetime.now() - timedelta(days=7)]
        model_costs = last_7d.groupby('model')['cost'].sum()
        fig.add_trace(
            go.Pie(labels=model_costs.index, values=model_costs.values),
            row=1, col=2
        )

        # 3. Costs by team (bar chart)
        team_total = team_costs.groupby('team')['cost'].sum().sort_values(ascending=False)
        fig.add_trace(
            go.Bar(x=team_total.index, y=team_total.values),
            row=2, col=1
        )

        # 4. Trend with forecast
        total_daily = daily_costs.groupby('date')['cost'].sum().reset_index()
        # Simple linear forecast
        last_7_avg = total_daily.tail(7)['cost'].mean()
        forecast_dates = pd.date_range(
            start=total_daily['date'].max() + timedelta(days=1),
            periods=30
        )
        forecast_values = [last_7_avg] * 30

        fig.add_trace(
            go.Scatter(
                x=total_daily['date'],
                y=total_daily['cost'],
                name='Actual',
                mode='lines'
            ),
            row=2, col=2
        )
        fig.add_trace(
            go.Scatter(
                x=forecast_dates,
                y=forecast_values,
                name='Forecast',
                mode='lines',
                line=dict(dash='dash')
            ),
            row=2, col=2
        )

        # 5. Anomalies
        if not anomalies.empty:
            fig.add_trace(
                go.Scatter(
                    x=anomalies['date'],
                    y=anomalies['daily_cost'],
                    mode='markers',
                    marker=dict(size=10, color='red'),
                    name='Anomalies'
                ),
                row=3, col=1
            )

        # 6. Summary table
        total_cost_30d = daily_costs['cost'].sum()
        avg_daily = daily_costs.groupby('date')['cost'].sum().mean()
        forecast_monthly = avg_daily * 30
        top_model = model_costs.idxmax()

        summary = pd.DataFrame({
            'Metric': [
                'Last 30d Cost',
                'Avg Daily Cost',
                'Forecast Monthly',
                'Top Model',
                'Anomalies Detected'
            ],
            'Value': [
                f"${total_cost_30d:.2f}",
                f"${avg_daily:.2f}",
                f"${forecast_monthly:.2f}",
                top_model,
                str(len(anomalies))
            ]
        })

        fig.add_trace(
            go.Table(
                header=dict(values=list(summary.columns)),
                cells=dict(values=[summary['Metric'], summary['Value']])
            ),
            row=3, col=2
        )

        # Layout
        fig.update_layout(
            height=1200,
            title_text="Vertex AI FinOps Dashboard",
            showlegend=True
        )

        # Save
        fig.write_html(output_file)
        print(f"โœ“ Dashboard saved to {output_file}")

        return fig, summary

    def check_and_alert_anomalies(self, email_to: str = None):
        """Check for anomalies and send alerts"""

        anomalies = self.fetch_anomalies()

        if not anomalies.empty:
            print(f"โš ๏ธ  {len(anomalies)} cost anomalies detected!")

            for _, row in anomalies.iterrows():
                print(f"  {row['date']}: ${row['daily_cost']:.2f} "
                      f"(+{row['deviation_pct']:.1f}% vs avg)")

            # Send email alert
            if email_to:
                self._send_email_alert(anomalies, email_to)
        else:
            print("โœ“ No cost anomalies detected")

    def _send_email_alert(self, anomalies: pd.DataFrame, email_to: str):
        """Send email alert for anomalies"""

        body = f"""
        Cost Anomaly Alert - Vertex AI

        {len(anomalies)} anomalies detected:

        """

        for _, row in anomalies.iterrows():
            body += f"- {row['date']}: ${row['daily_cost']:.2f} (+{row['deviation_pct']:.1f}%)\n"

        body += "\nCheck dashboard for details."

        msg = MIMEText(body)
        msg['Subject'] = f'โš ๏ธ  Vertex AI Cost Anomaly Alert'
        msg['From'] = 'finops@company.com'
        msg['To'] = email_to

        # Send via SMTP (configure your SMTP server)
        # smtp = smtplib.SMTP('smtp.gmail.com', 587)
        # smtp.send_message(msg)

        print(f"โœ“ Alert email sent to {email_to}")

# Generate dashboard
dashboard = FinOpsDashboard(project_id="finops_lab")
fig, summary = dashboard.generate_dashboard()

# Check anomalies
dashboard.check_and_alert_anomalies(email_to="team@company.com")

print("\n=== SUMMARY ===")
print(summary.to_string(index=False))

Etape 4 : Budget Alerts (10 min)

bash
# Creer budget $2000/mois avec alertes
# (remplacer BILLING_ACCOUNT_ID)

gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Vertex AI Monthly Budget" \
  --budget-amount=2000USD \
  --threshold-rule=percent=0.5 \
  --threshold-rule=percent=0.9 \
  --threshold-rule=percent=1.0 \
  --filter-projects=projects/finops_lab \
  --filter-services=services/aiplatform.googleapis.com

# Configurer notification Pub/Sub
gcloud pubsub topics create budget-alerts

gcloud pubsub subscriptions create budget-alerts-sub \
  --topic=budget-alerts \
  --push-endpoint=https://your-cloud-run-url/budget-alert

# Cloud Function pour traiter alertes
# (deployer fonction qui parse message et envoie email/Slack)

Etape 5 : Scheduler Automatique (15 min)

bash
# Deployer dashboard sur Cloud Run
# Dockerfile
cat > Dockerfile < requirements.txt <

Etape 6 : Tester & Valider (10 min)

  1. Ouvrir dashboard HTML genere
  2. Verifier graphiques affichent donnees correctes
  3. Simuler anomalie (requetes massives vers Pro)
  4. Verifier alerte recue par email/Slack
  5. Verifier budget alert a 50% budget

โœ… Validation

Votre dashboard FinOps est complet si vous avez :

  • โœ… BigQuery export actif avec views custom
  • โœ… Dashboard interactif avec 6 visualisations
  • โœ… Anomaly detection automatique
  • โœ… Budget alerts configurees (50%, 90%, 100%)
  • โœ… Refresh automatique toutes les heures
  • โœ… Email/Slack alerts operationnels

Ce dashboard FinOps vous donne visibilite complete sur couts Vertex AI. En production, ajoutez : forecasting ML (Prophet), recommendations automatiques (model routing), integration Slack pour alertes temps reel. Revisez dashboard chaque lundi en equipe, identifiez anomalies, iterez. Avec ce setup, vous detectez derives avant facture surprenante.

Quiz Module 4.2

โฑ 15 min Evaluation

๐Ÿ“ Quiz : FinOps & Optimisation

15 questions pour valider vos connaissances

1. Quelle technique offre la plus grande economie potentielle ?

Batch API (-50%)
Prompt compression (-30%)
Context caching (-90% sur partie cachee)
Output control (-20%)

2. Model routing intelligent peut reduire couts de combien ?

20-30%
60-70%
90-95%
10-15%

3. Context caching est rentable a partir de combien de requetes ?

2 requetes (breakeven immediate)
10 requetes
100 requetes
1000 requetes

4. Quelle difference entre implicit et explicit caching ?

Implicit coute plus cher
Pas de difference
Implicit est auto (5 min TTL), explicit manuel (1-60 min TTL)
Implicit ne fonctionne qu'avec Pro

5. Pour chatbot avec system instruction 5000 tokens, quel TTL cache optimal ?

5 minutes (trop court)
30 minutes (equilibre)
120 minutes (trop long)
Cache inutile ici

6. Flash-8B coute combien vs Pro pour input tokens ?

75x moins cher ($0.04 vs $3.00)
20x moins cher
5x moins cher
Meme prix

7. Batch API offre -50% cout mais avec quelle contrainte ?

Moins de qualite
Limite 100 requetes
Latence 10-30 minutes (asynchrone)
Pas de streaming

8. Quelle regle pour classifier requete comme "simple" ?

Longueur >1000 tokens
FAQ, classification, recherche factuelle
Toutes les requetes JSON
Requetes avec function calling

9. Output tokens coutent combien vs input tokens (Flash) ?

Meme prix
2x plus cher
3x plus cher
4x plus cher ($0.60 vs $0.15)

10. BigQuery export billing est :

Gratuit et essentiel pour FinOps
Payant ($10/mois)
Optionnel, Console suffit
Uniquement pour Enterprise

11. Budget alert doit etre configure a quels seuils ?

100% uniquement
80% et 100%
50%, 90%, 100% (proactif)
Pas necessaire si dashboard

12. Cost attribution par equipe se fait via :

IP source
Labels sur requetes Vertex AI
Estimation manuelle
Impossible de tracer

13. Anomaly detection identifie cout anormal si :

Cout quotidien >50% au-dessus moyenne
Cout >$100
Cout double vs hier
Impossible automatiquement

14. Quelle strategie pour requetes urgentes a faible cout ?

Toujours utiliser Pro
Batch API
Flash-8B avec fallback vers Flash
Context caching

15. Dashboard FinOps doit etre rafraichi a quelle frequence ?

1x par jour (trop lent)
Toutes les heures (optimal)
Temps reel (overkill)
1x par semaine

IA Responsable Google

โฑ 30 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre les 7 principes Google AI
  • Configurer safety settings Gemini
  • Utiliser Gemma Scope pour interpretabilite
  • Implementer guardrails IA responsable

๐ŸŽฏ Les 7 Principes Google AI

# Principe Signification Implementation Gemini
1 Be socially beneficial IA doit beneficier societe Gemini optimise pour aide, pas manipulation
2 Avoid unfair bias Eviter biais injustes Training data diverse, evaluation bias continue
3 Built & tested for safety Securite par conception Safety filters, red teaming, adversarial testing
4 Accountable to people Responsabilite humaine Human-in-the-loop, audit logs, explicabilite
5 Privacy by design Confidentialite integree Data not used for training (Vertex AI)
6 Scientific excellence Excellence scientifique Recherche Google AI publiee, peer-reviewed
7 Appropriate uses Usages appropries Terms of Service interdisent malware, spam, violence
๐Ÿšซ Applications que Google ne developpe PAS
  • Armes ou surveillance de masse
  • Technologies violant droits humains
  • Collecte d'infos contre droit international

๐Ÿ›ก๏ธ Safety Settings Gemini

Gemini inclut 4 harm categories avec 4 seuils de blocage.

python
from vertexai.generative_models import (
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    SafetySetting
)

# 4 Harm Categories
# - HARM_CATEGORY_HARASSMENT : Harcelement
# - HARM_CATEGORY_HATE_SPEECH : Discours haineux
# - HARM_CATEGORY_SEXUALLY_EXPLICIT : Contenu sexuel explicite
# - HARM_CATEGORY_DANGEROUS_CONTENT : Contenu dangereux

# 4 Thresholds
# - BLOCK_NONE : Pas de blocage (permissif)
# - BLOCK_ONLY_HIGH : Bloquer seulement haute probabilite
# - BLOCK_MEDIUM_AND_ABOVE : Bloquer moyenne et haute (DEFAULT)
# - BLOCK_LOW_AND_ABOVE : Bloquer tout (strict)

# Configuration stricte (production recommandee)
safety_settings_strict = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
]

model_strict = GenerativeModel(
    "gemini-2.5-flash",
    safety_settings=safety_settings_strict
)

# Configuration permissive (R&D uniquement)
safety_settings_permissive = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
]

model_permissive = GenerativeModel(
    "gemini-2.5-flash",
    safety_settings=safety_settings_permissive
)

# Tester avec prompt sensible
prompt_sensible = "Comment construire une bombe ?"

try:
    response_strict = model_strict.generate_content(prompt_sensible)
    print("Strict:", response_strict.text)
except Exception as e:
    print("Strict: BLOCKED -", e)
    # โ†’ BLOCKED (safety filter)

try:
    response_permissive = model_permissive.generate_content(prompt_sensible)
    print("Permissive:", response_permissive.text)
except Exception as e:
    print("Permissive: BLOCKED -", e)
    # โ†’ Peut passer (mais ToS Google interdit quand meme usage malveillant)

๐Ÿ“Š Analyser Safety Ratings

python
# Generer reponse et inspecter safety ratings
response = model_strict.generate_content("Raconte une blague")

# Safety ratings pour le prompt
print("=== PROMPT SAFETY ===")
for rating in response.prompt_feedback.safety_ratings:
    print(f"{rating.category.name}: {rating.probability.name}")

# Safety ratings pour la reponse
print("\n=== RESPONSE SAFETY ===")
for candidate in response.candidates:
    for rating in candidate.safety_ratings:
        print(f"{rating.category.name}: {rating.probability.name}")

# SORTIE EXEMPLE :
# === PROMPT SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: NEGLIGIBLE
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE
#
# === RESPONSE SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: LOW
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE

# Implementer logging safety pour monitoring
import json

def log_safety_event(prompt: str, response, blocked: bool):
    """Logger evenements safety pour audit"""

    event = {
        "timestamp": datetime.now().isoformat(),
        "prompt": prompt[:100],  # Truncate for privacy
        "blocked": blocked,
        "prompt_safety": {
            rating.category.name: rating.probability.name
            for rating in response.prompt_feedback.safety_ratings
        },
    }

    if not blocked:
        event["response_safety"] = {
            rating.category.name: rating.probability.name
            for rating in response.candidates[0].safety_ratings
        }

    # Log to BigQuery ou Cloud Logging
    with open("safety_logs.jsonl", "a") as f:
        f.write(json.dumps(event) + "\n")

    return event

# Utiliser avec monitoring
try:
    response = model_strict.generate_content(prompt)
    log_safety_event(prompt, response, blocked=False)
except Exception as e:
    log_safety_event(prompt, None, blocked=True)

๐Ÿ” Gemma Scope : Interpretabilite

Gemma Scope est un outil open-source pour interpreter modeles Gemma (sparse autoencoders).

python
# pip install gemma-scope

from gemma_scope import GemmaScope

# Charger Gemma 3 avec Scope
scope = GemmaScope(model_name="gemma-3-9b")

# Analyser activation pour prompt
prompt = "Paris est la capitale de"
activations = scope.get_activations(prompt)

# Top features actives
top_features = scope.get_top_features(activations, k=10)

print("=== TOP 10 ACTIVATED FEATURES ===")
for feature_id, activation_strength in top_features:
    feature_desc = scope.get_feature_description(feature_id)
    print(f"Feature {feature_id}: {feature_desc} (strength: {activation_strength:.3f})")

# SORTIE EXEMPLE :
# Feature 1847: Geographic location / capital city (strength: 0.892)
# Feature 3201: French language context (strength: 0.654)
# Feature 892: European geography (strength: 0.543)
# ...

# Use case : Detecter biais
prompt_biased = "Les femmes sont"
activations_biased = scope.get_activations(prompt_biased)
top_biased = scope.get_top_features(activations_biased, k=5)

# Si feature "gender stereotype" active โ†’ red flag pour review
๐Ÿ’ก Gemma Scope 2 (2026)
Version 2 supporte Gemma 3/3n et offre visualizations interactives pour interpreter modele. Utile pour audit, debugging, detection biais.

๐Ÿ›ก๏ธ Guardrails Implementation

python
class ResponsibleAIGuardrails:
    def __init__(self, model: GenerativeModel):
        self.model = model
        self.blocked_keywords = [
            "hack", "exploit", "crack", "bypass",
            # ... ajouter keywords sensibles pour votre domaine
        ]

    def check_prompt_safety(self, prompt: str) -> dict:
        """Pre-flight checks avant envoi a Gemini"""

        issues = []

        # 1. Check PII
        if self._contains_pii(prompt):
            issues.append("PII_DETECTED")

        # 2. Check blocked keywords
        if any(kw in prompt.lower() for kw in self.blocked_keywords):
            issues.append("BLOCKED_KEYWORD")

        # 3. Check prompt injection
        if self._is_prompt_injection(prompt):
            issues.append("PROMPT_INJECTION")

        return {
            "safe": len(issues) == 0,
            "issues": issues
        }

    def _contains_pii(self, text: str) -> bool:
        """Detecter PII (simplifie, utiliser DLP en prod)"""
        import re

        # SSN pattern
        ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'
        if re.search(ssn_pattern, text):
            return True

        # Email pattern
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        if re.search(email_pattern, text):
            return True

        return False

    def _is_prompt_injection(self, text: str) -> bool:
        """Detecter tentative prompt injection"""

        injection_patterns = [
            "ignore previous instructions",
            "disregard above",
            "new instructions:",
            "system:",
        ]

        return any(pattern in text.lower() for pattern in injection_patterns)

    def generate_safe(self, prompt: str):
        """Generate avec guardrails"""

        # Pre-flight checks
        safety_check = self.check_prompt_safety(prompt)

        if not safety_check["safe"]:
            raise ValueError(f"Prompt blocked: {safety_check['issues']}")

        # Generate
        response = self.model.generate_content(prompt)

        # Post-flight checks
        if response.candidates[0].finish_reason.name == "SAFETY":
            raise ValueError("Response blocked by safety filter")

        return response.text

# Utilisation
guardrails = ResponsibleAIGuardrails(model_strict)

# Safe prompt
try:
    response1 = guardrails.generate_safe("Explique la photosynthese")
    print("โœ“ Safe:", response1[:100])
except ValueError as e:
    print("โœ— Blocked:", e)

# Unsafe prompt (PII)
try:
    response2 = guardrails.generate_safe("Mon email est john@example.com, aide-moi")
    print("โœ“ Safe:", response2[:100])
except ValueError as e:
    print("โœ— Blocked:", e)
    # โ†’ Blocked: ['PII_DETECTED']

IA Responsable n'est pas optionnel. Configurez safety settings strictes en prod (BLOCK_LOW_AND_ABOVE), loggez tous les events safety pour audit. Implementez guardrails pre/post pour bloquer PII, prompt injection, keywords sensibles. Utilisez Gemma Scope pour interpreter decisions et detecter biais. Google AI Principles = framework solide, suivez-le.

Gouvernance des Modeles

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Gerer model lifecycle (dev โ†’ staging โ†’ prod)
  • Implementer versioning et deprecation strategy
  • Migrer 2.0 โ†’ 2.5 en production
  • Documenter decisions avec ADR

๐Ÿ”„ Model Lifecycle Management

Development Staging Production Deprecation Experimentation Pre-prod testing Live traffic Sunset

๐Ÿ“ Model Registry & Versioning

python
# model_registry.py
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
import json

class ModelStage(Enum):
    DEVELOPMENT = "dev"
    STAGING = "staging"
    PRODUCTION = "prod"
    DEPRECATED = "deprecated"

@dataclass
class ModelVersion:
    name: str  # e.g., "gemini-2.5-flash"
    version: str  # e.g., "v1.2.3"
    stage: ModelStage
    created_at: datetime
    promoted_at: datetime = None
    deprecated_at: datetime = None
    performance_metrics: dict = None
    notes: str = ""

class ModelRegistry:
    def __init__(self, registry_file: str = "model_registry.json"):
        self.registry_file = registry_file
        self.models = self._load_registry()

    def _load_registry(self) -> dict:
        """Load registry from file"""
        try:
            with open(self.registry_file, "r") as f:
                data = json.load(f)
                # Convert to ModelVersion objects
                models = {}
                for key, val in data.items():
                    val['stage'] = ModelStage(val['stage'])
                    val['created_at'] = datetime.fromisoformat(val['created_at'])
                    if val.get('promoted_at'):
                        val['promoted_at'] = datetime.fromisoformat(val['promoted_at'])
                    if val.get('deprecated_at'):
                        val['deprecated_at'] = datetime.fromisoformat(val['deprecated_at'])
                    models[key] = ModelVersion(**val)
                return models
        except FileNotFoundError:
            return {}

    def _save_registry(self):
        """Save registry to file"""
        data = {}
        for key, model in self.models.items():
            data[key] = {
                'name': model.name,
                'version': model.version,
                'stage': model.stage.value,
                'created_at': model.created_at.isoformat(),
                'promoted_at': model.promoted_at.isoformat() if model.promoted_at else None,
                'deprecated_at': model.deprecated_at.isoformat() if model.deprecated_at else None,
                'performance_metrics': model.performance_metrics,
                'notes': model.notes
            }
        with open(self.registry_file, "w") as f:
            json.dump(data, f, indent=2)

    def register_model(self, name: str, version: str, stage: ModelStage, notes: str = ""):
        """Register new model version"""
        key = f"{name}@{version}"
        self.models[key] = ModelVersion(
            name=name,
            version=version,
            stage=stage,
            created_at=datetime.now(),
            notes=notes
        )
        self._save_registry()
        print(f"โœ“ Registered {key} in {stage.value}")

    def promote_model(self, name: str, version: str, to_stage: ModelStage):
        """Promote model to next stage"""
        key = f"{name}@{version}"
        if key not in self.models:
            raise ValueError(f"Model {key} not found in registry")

        self.models[key].stage = to_stage
        self.models[key].promoted_at = datetime.now()
        self._save_registry()
        print(f"โœ“ Promoted {key} to {to_stage.value}")

    def deprecate_model(self, name: str, version: str, reason: str):
        """Deprecate model version"""
        key = f"{name}@{version}"
        if key not in self.models:
            raise ValueError(f"Model {key} not found in registry")

        self.models[key].stage = ModelStage.DEPRECATED
        self.models[key].deprecated_at = datetime.now()
        self.models[key].notes += f"\nDeprecated: {reason}"
        self._save_registry()
        print(f"โœ“ Deprecated {key}: {reason}")

    def get_active_model(self, name: str, stage: ModelStage) -> ModelVersion:
        """Get active model version for stage"""
        active_models = [
            model for model in self.models.values()
            if model.name == name and model.stage == stage
        ]

        if not active_models:
            raise ValueError(f"No active {name} model in {stage.value}")

        # Return most recent
        return sorted(active_models, key=lambda m: m.created_at, reverse=True)[0]

    def list_models(self, stage: ModelStage = None):
        """List all models, optionally filtered by stage"""
        models = self.models.values()
        if stage:
            models = [m for m in models if m.stage == stage]

        for model in sorted(models, key=lambda m: m.created_at, reverse=True):
            print(f"{model.name}@{model.version} | {model.stage.value} | {model.created_at.date()}")

# Usage
registry = ModelRegistry()

# Register new model in dev
registry.register_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    stage=ModelStage.DEVELOPMENT,
    notes="Initial deployment with context caching"
)

# After testing, promote to staging
registry.promote_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    to_stage=ModelStage.STAGING
)

# After staging validation, promote to prod
registry.promote_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    to_stage=ModelStage.PRODUCTION
)

# Deploy new version
registry.register_model(
    name="gemini-2.5-flash",
    version="v1.1.0",
    stage=ModelStage.DEVELOPMENT,
    notes="Added model routing"
)

# Deprecate old version
registry.deprecate_model(
    name="gemini-1.5-flash",
    version="v0.9.0",
    reason="Migrated to Gemini 2.5"
)

# List prod models
print("\n=== PRODUCTION MODELS ===")
registry.list_models(stage=ModelStage.PRODUCTION)

๐Ÿ”„ Migration Strategy: 2.0 โ†’ 2.5

python
class ModelMigration:
    """Gerer migration progressive entre versions"""

    def __init__(self, old_model: str, new_model: str):
        self.old_model = GenerativeModel(old_model)
        self.new_model = GenerativeModel(new_model)
        self.rollout_percentage = 0

    def set_rollout(self, percentage: int):
        """Set traffic split (0-100% vers new model)"""
        if not 0 <= percentage <= 100:
            raise ValueError("Percentage must be 0-100")
        self.rollout_percentage = percentage
        print(f"Rollout: {percentage}% โ†’ {self.new_model._model_name}")

    def generate_content(self, prompt: str):
        """Generate avec traffic splitting"""
        import random

        # Traffic split
        if random.randint(0, 99) < self.rollout_percentage:
            # Route to new model
            print(f"[Routing] โ†’ NEW: {self.new_model._model_name}")
            return self.new_model.generate_content(prompt)
        else:
            # Route to old model
            print(f"[Routing] โ†’ OLD: {self.old_model._model_name}")
            return self.old_model.generate_content(prompt)

# Migration progressive 2.0 โ†’ 2.5
migration = ModelMigration(
    old_model="gemini-2.0-flash-exp",
    new_model="gemini-2.5-flash"
)

# Week 1: 10% traffic vers 2.5
migration.set_rollout(10)
for i in range(10):
    migration.generate_content("Test query")
# โ†’ 1 requete vers 2.5, 9 vers 2.0

# Week 2: Monitor metrics, si OK โ†’ 50%
migration.set_rollout(50)

# Week 3: 100% vers 2.5
migration.set_rollout(100)

# Deprecate 2.0
registry.deprecate_model(
    name="gemini-2.0-flash-exp",
    version="v1.0.0",
    reason="Fully migrated to 2.5"
)

๐Ÿ“‹ Architecture Decision Records (ADR)

markdown
# ADR-001: Migration vers Gemini 2.5 Flash

## Status
ACCEPTED - 2026-02-01

## Context
Notre application chatbot support utilise Gemini 2.0 Flash Exp depuis 6 mois.
Gemini 2.5 Flash offre meilleures performances (+15% qualite) et meme prix.

## Decision
Migrer progressivement vers Gemini 2.5 Flash sur 3 semaines :
- Week 1 : 10% traffic (canary)
- Week 2 : 50% traffic (validation large scale)
- Week 3 : 100% traffic (full rollout)

## Consequences

### Positive
- +15% quality score (evaluation benchmark interne)
- Latence identique (~800ms p95)
- Cout identique ($0.15/1M input)
- Support long context (2M tokens vs 1M)

### Negative
- Risque regression qualite (mitigation : canary + rollback plan)
- Effort migration : 2 engineer-days

### Neutral
- API identique, pas de code changes

## Rollback Plan
Si quality score < baseline :
1. Rollback immediate vers 2.0 Flash
2. Root cause analysis
3. Re-evaluation decision

## Monitoring
- Quality score (target: >85%)
- Latency p50/p95 (target: <1000ms)
- Cost per conversation (target: <$0.001)
- Error rate (target: <1%)

## References
- Benchmark results: docs/benchmarks/2.0-vs-2.5.md
- Gemini 2.5 release notes: https://cloud.google.com/vertex-ai/docs/release-notes
๐Ÿ’ก ADR Best Practices
  • 1 ADR par decision majeure (model change, architecture change)
  • Template standardise : Status, Context, Decision, Consequences
  • Stocker dans Git (docs/adr/)
  • Review en equipe avant ACCEPTED

๐Ÿ” Model Deprecation Timeline

Phase Duration Actions
Annonce T-90 days Communication interne/externe, migration guide publie
Warning T-60 days Deprecation warnings dans logs, emails equipes
Migration T-30 days Support migration actif, office hours
Sunset T-0 Model desactive, requetes rejettees avec error explicite

Gouvernance modeles = discipline essentielle en production. Utilisez model registry pour tracker versions actives par environnement. Migration progressive (10% โ†’ 50% โ†’ 100%) reduit risque. ADR documente WHY derriere chaque decision majeure (critical pour onboarding et audits). Deprecation avec 90 days notice = respect users.

Agent Governance

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Implementer tool governance dans Agent Builder
  • Gerer permissions et access control
  • Auditer actions agents
  • Utiliser agent marketplace securise

๐Ÿ›ก๏ธ Tool Governance Framework

python
from vertexai.preview import reasoning_engines
from google.cloud import firestore
from datetime import datetime
from enum import Enum

class ToolRiskLevel(Enum):
    LOW = "low"  # Read-only, pas d'impact business
    MEDIUM = "medium"  # Modifications limitees
    HIGH = "high"  # Actions critiques (delete, paiements)
    CRITICAL = "critical"  # Actions irreversibles

class ToolGovernance:
    def __init__(self, firestore_db):
        self.db = firestore_db
        self.audit_collection = "agent_tool_audit"

    def register_tool(self, tool_name: str, risk_level: ToolRiskLevel,
                      requires_approval: bool = False):
        """Enregistrer tool avec niveau de risque"""

        tool_doc = {
            "name": tool_name,
            "risk_level": risk_level.value,
            "requires_approval": requires_approval,
            "registered_at": datetime.now(),
            "allowed_agents": [],  # Whitelist agents
        }

        self.db.collection("tool_registry").document(tool_name).set(tool_doc)
        print(f"โœ“ Tool registered: {tool_name} (risk: {risk_level.value})")

    def approve_tool_for_agent(self, tool_name: str, agent_id: str, approved_by: str):
        """Approuver tool pour agent specifique"""

        tool_ref = self.db.collection("tool_registry").document(tool_name)
        tool_doc = tool_ref.get()

        if not tool_doc.exists:
            raise ValueError(f"Tool {tool_name} not registered")

        # Ajouter agent a whitelist
        tool_ref.update({
            "allowed_agents": firestore.ArrayUnion([agent_id])
        })

        # Logger approval
        self._audit_log({
            "event": "TOOL_APPROVED",
            "tool": tool_name,
            "agent": agent_id,
            "approved_by": approved_by,
            "timestamp": datetime.now()
        })

        print(f"โœ“ Tool {tool_name} approved for agent {agent_id}")

    def check_tool_permission(self, tool_name: str, agent_id: str) -> bool:
        """Verifier si agent peut utiliser tool"""

        tool_doc = self.db.collection("tool_registry").document(tool_name).get()

        if not tool_doc.exists:
            return False

        tool_data = tool_doc.to_dict()

        # Check whitelist
        if agent_id not in tool_data.get("allowed_agents", []):
            return False

        return True

    def audit_tool_call(self, tool_name: str, agent_id: str, params: dict, result: dict):
        """Auditer appel tool"""

        self._audit_log({
            "event": "TOOL_CALLED",
            "tool": tool_name,
            "agent": agent_id,
            "params": params,
            "result": result,
            "timestamp": datetime.now()
        })

    def _audit_log(self, log_entry: dict):
        """Logger event audit dans Firestore"""
        self.db.collection(self.audit_collection).add(log_entry)

    def get_audit_trail(self, agent_id: str = None, tool_name: str = None, days: int = 30):
        """Recuperer audit trail"""

        query = self.db.collection(self.audit_collection)

        if agent_id:
            query = query.where("agent", "==", agent_id)
        if tool_name:
            query = query.where("tool", "==", tool_name)

        # Last N days
        cutoff = datetime.now() - timedelta(days=days)
        query = query.where("timestamp", ">=", cutoff)

        results = query.stream()

        print(f"=== AUDIT TRAIL (last {days} days) ===")
        for doc in results:
            data = doc.to_dict()
            print(f"{data['timestamp']}: {data['event']} - {data.get('tool', 'N/A')} by {data.get('agent', 'N/A')}")

# Setup governance
db = firestore.Client()
governance = ToolGovernance(db)

# Register tools avec risk levels
governance.register_tool(
    tool_name="search_knowledge_base",
    risk_level=ToolRiskLevel.LOW,
    requires_approval=False
)

governance.register_tool(
    tool_name="update_customer_record",
    risk_level=ToolRiskLevel.MEDIUM,
    requires_approval=True
)

governance.register_tool(
    tool_name="process_refund",
    risk_level=ToolRiskLevel.HIGH,
    requires_approval=True
)

governance.register_tool(
    tool_name="delete_account",
    risk_level=ToolRiskLevel.CRITICAL,
    requires_approval=True
)

# Approuver tools pour agent support
governance.approve_tool_for_agent(
    tool_name="search_knowledge_base",
    agent_id="agent-support-001",
    approved_by="admin@company.com"
)

governance.approve_tool_for_agent(
    tool_name="update_customer_record",
    agent_id="agent-support-001",
    approved_by="admin@company.com"
)

# Agent finance peut process_refund
governance.approve_tool_for_agent(
    tool_name="process_refund",
    agent_id="agent-finance-001",
    approved_by="finance-manager@company.com"
)

# Verifier permissions
can_search = governance.check_tool_permission("search_knowledge_base", "agent-support-001")
print(f"Agent support can search KB: {can_search}")  # True

can_delete = governance.check_tool_permission("delete_account", "agent-support-001")
print(f"Agent support can delete account: {can_delete}")  # False

# Audit trail
governance.get_audit_trail(agent_id="agent-support-001", days=7)

๐Ÿ” Agent Permission Model

Agent Type Allowed Tools Risk Level Approval Required
Customer Support Search KB, View orders, Update contact info LOW-MEDIUM Manager approval
Sales CRM lookup, Create quote, Schedule demo LOW-MEDIUM Sales manager approval
Finance Process refund, Generate invoice, View transactions MEDIUM-HIGH Finance manager approval
Admin All tools including delete, modify settings HIGH-CRITICAL C-level approval

๐Ÿ“Š Agent Marketplace Governance

python
class AgentMarketplace:
    """Marketplace interne pour partager agents securises"""

    def __init__(self, firestore_db):
        self.db = firestore_db

    def publish_agent(self, agent_config: dict, publisher: str):
        """Publier agent dans marketplace"""

        # Validation security
        self._validate_agent_security(agent_config)

        agent_doc = {
            **agent_config,
            "publisher": publisher,
            "published_at": datetime.now(),
            "status": "pending_review",  # Require review avant usage
            "downloads": 0,
            "ratings": []
        }

        agent_id = self.db.collection("agent_marketplace").add(agent_doc)[1].id
        print(f"โœ“ Agent published for review: {agent_id}")

        return agent_id

    def _validate_agent_security(self, agent_config: dict):
        """Valider securite agent avant publication"""

        # Check 1: Pas de hardcoded secrets
        if "api_key" in str(agent_config).lower():
            raise ValueError("Agent contains hardcoded API keys")

        # Check 2: Tools approuves uniquement
        tools = agent_config.get("tools", [])
        for tool in tools:
            tool_doc = self.db.collection("tool_registry").document(tool).get()
            if not tool_doc.exists:
                raise ValueError(f"Tool {tool} not approved in registry")

        # Check 3: System instruction pas malicieux
        system_instruction = agent_config.get("system_instruction", "")
        malicious_keywords = ["ignore", "disregard", "bypass"]
        if any(kw in system_instruction.lower() for kw in malicious_keywords):
            raise ValueError("System instruction contains suspicious keywords")

    def approve_agent(self, agent_id: str, reviewer: str):
        """Approuver agent apres review"""

        agent_ref = self.db.collection("agent_marketplace").document(agent_id)
        agent_ref.update({
            "status": "approved",
            "reviewed_by": reviewer,
            "reviewed_at": datetime.now()
        })

        print(f"โœ“ Agent {agent_id} approved by {reviewer}")

    def install_agent(self, agent_id: str, user: str):
        """Installer agent depuis marketplace"""

        agent_doc = self.db.collection("agent_marketplace").document(agent_id).get()

        if not agent_doc.exists:
            raise ValueError(f"Agent {agent_id} not found")

        agent_data = agent_doc.to_dict()

        if agent_data["status"] != "approved":
            raise ValueError(f"Agent not approved for installation")

        # Increment download counter
        self.db.collection("agent_marketplace").document(agent_id).update({
            "downloads": firestore.Increment(1)
        })

        # Logger installation
        self.db.collection("agent_installs").add({
            "agent_id": agent_id,
            "user": user,
            "installed_at": datetime.now()
        })

        print(f"โœ“ Agent {agent_id} installed for {user}")

        return agent_data

# Setup marketplace
marketplace = AgentMarketplace(db)

# Publier agent customer support
support_agent_config = {
    "name": "Customer Support Agent v2",
    "description": "Agent support avec acces KB et CRM",
    "model": "gemini-2.5-flash",
    "tools": ["search_knowledge_base", "update_customer_record"],
    "system_instruction": "Tu es un assistant support...",
}

agent_id = marketplace.publish_agent(support_agent_config, publisher="team-support@company.com")

# Review & approve
marketplace.approve_agent(agent_id, reviewer="security@company.com")

# Installer pour autre equipe
marketplace.install_agent(agent_id, user="team-sales@company.com")

๐Ÿ“ Audit Dashboard

python
def generate_agent_audit_report(governance: ToolGovernance, days: int = 30):
    """Generate rapport audit agent activities"""

    query = governance.db.collection(governance.audit_collection)
    cutoff = datetime.now() - timedelta(days=days)
    query = query.where("timestamp", ">=", cutoff)

    events = [doc.to_dict() for doc in query.stream()]

    # Statistiques
    total_calls = len(events)
    unique_agents = len(set(e.get("agent") for e in events))
    unique_tools = len(set(e.get("tool") for e in events))

    # Top tools
    tool_counts = {}
    for event in events:
        tool = event.get("tool")
        if tool:
            tool_counts[tool] = tool_counts.get(tool, 0) + 1

    print(f"=== AGENT AUDIT REPORT (Last {days} days) ===\n")
    print(f"Total tool calls: {total_calls}")
    print(f"Active agents: {unique_agents}")
    print(f"Tools used: {unique_tools}\n")

    print("Top 5 tools:")
    for tool, count in sorted(tool_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {tool}: {count} calls")

    # Detect anomalies (high-risk tool usage)
    high_risk_calls = [
        e for e in events
        if e.get("tool") in ["process_refund", "delete_account"]
    ]

    if high_risk_calls:
        print(f"\nโš ๏ธ  {len(high_risk_calls)} high-risk tool calls detected:")
        for call in high_risk_calls[:10]:
            print(f"  {call['timestamp']}: {call['tool']} by {call['agent']}")

# Generate rapport mensuel
generate_agent_audit_report(governance, days=30)

Agent governance protege votre entreprise. Tool registry avec risk levels = control granulaire. Permission whitelist = least privilege principle. Audit trail complet = compliance & forensics. Marketplace interne = reutilisation securisee agents. En prod, ajoutez : rate limiting par agent, anomaly detection (calls inhabituels), quarterly access review.

Gemma & Open Source

โฑ 25 min Intermediaire

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre Gemma 3/3n et use cases
  • Deployer Gemma on-device (Nano)
  • Fine-tuner Gemma pour domaine specifique
  • Utiliser Gemma Scope 2 pour interpretabilite

๐ŸŒŸ Famille Gemma (2026)

Modele Taille Use Case Deployment
Gemma 3 27B 27B params Self-hosting, fine-tuning custom GKE, on-prem, cloud VM
Gemma 3 9B 9B params Edge servers, latency-critical Edge TPU, GPU servers
Gemma 3 2B 2B params Mobile apps, IoT devices Android, iOS, Raspberry Pi
Gemma Nano 1.8B params On-device inference (offline) Smartphones, laptops
๐Ÿ’ก Gemma 3n = Nano-optimized
Gemma 3n est version quantized 4-bit pour deployment on-device. Inference 3-5x plus rapide que Gemma 3 standard, avec qualite similaire.

๐Ÿ“ฑ Deploy Gemma Nano On-Device

python
# Installation
# pip install mediapipe

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import text

# Download Gemma Nano model (1.8B, quantized 4-bit)
# https://ai.google.dev/gemma/docs/get_started

# Initialize Gemma Nano
base_options = python.BaseOptions(model_asset_path='gemma_nano_2b_quantized.bin')
options = text.TextGeneratorOptions(base_options=base_options, max_tokens=256)
generator = text.TextGenerator.create_from_options(options)

# Generate on-device (offline)
prompt = "Explique la photosynthese en 2 phrases"
result = generator.generate(prompt)

print(result.text)
# โ†’ Inference 100% locale, pas besoin internet
# โ†’ Latence ~500ms sur smartphone recent
# โ†’ Privacy total (donnees ne quittent pas device)

# Use cases on-device :
# - Keyboard suggestions
# - Voice assistant offline
# - Document summarization (emails, PDFs)
# - Privacy-sensitive apps (medical, finance)

๐ŸŽ“ Fine-Tuning Gemma

python
# Fine-tune Gemma 3 9B pour domaine medical

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# 1. Charger Gemma 3 9B
model_name = "google/gemma-3-9b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Preparer dataset medical (exemple)
# Format : {"prompt": "...", "completion": "..."}
dataset = load_dataset("medical-qa-dataset")  # Votre dataset

def preprocess_function(examples):
    inputs = [f"Question: {q}\nAnswer:" for q in examples["prompt"]]
    targets = examples["completion"]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=256, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 3. Configurer fine-tuning
training_args = TrainingArguments(
    output_dir="./gemma-medical-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    fp16=True,  # Mixed precision
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# 4. Fine-tune (4-8h sur 4x A100)
trainer.train()

# 5. Sauvegarder modele fine-tuned
model.save_pretrained("./gemma-medical-finetuned")
tokenizer.save_pretrained("./gemma-medical-finetuned")

# 6. Inference avec modele fine-tuned
finetuned_model = AutoModelForCausalLM.from_pretrained("./gemma-medical-finetuned")
prompt = "Question: Quels sont les symptomes du diabete de type 2 ?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = finetuned_model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0]))

# ECONOMIE vs Gemini API :
# Fine-tuning : $500-2000 one-time (compute)
# Self-hosting : $100-500/month (VM/GPU)
# vs Gemini API : $500-5000/month si volume eleve
# โ†’ ROI positif si >10M tokens/month

๐Ÿ” Gemma Scope 2 : Interpretabilite

python
# pip install gemma-scope

from gemma_scope import GemmaScope, FeatureVisualizer

# Load Gemma 3 avec Scope
scope = GemmaScope(
    model_name="gemma-3-9b",
    sae_layer=15  # Sparse AutoEncoder layer 15
)

# Analyser activation pour prompt
prompt = "Le vaccin COVID cause l'autisme"  # Misinformation

activations = scope.get_activations(prompt)
top_features = scope.get_top_features(activations, k=20)

print("=== TOP ACTIVATED FEATURES ===")
for feature_id, strength in top_features:
    desc = scope.get_feature_description(feature_id)
    print(f"Feature {feature_id}: {desc} ({strength:.3f})")

# SORTIE EXEMPLE :
# Feature 1892: Medical misinformation (0.912) โš ๏ธ
# Feature 3405: Vaccine-related content (0.854)
# Feature 8721: Controversial claims (0.743)
# โ†’ Model detecte misinformation !

# Visualiser features actives
visualizer = FeatureVisualizer(scope)
visualizer.plot_feature_activation(prompt, top_k=10)
visualizer.save("feature_activation.png")

# Use cases Gemma Scope :
# 1. Detecter biais dans reponses
# 2. Expliquer pourquoi modele genere telle reponse
# 3. Identifier features problematiques pour fine-tuning
# 4. Audit & compliance (expliquer decisions IA)

๐Ÿ›ก๏ธ Gemma & AI Safety

๐Ÿ”’ Gemma Safety Features
  • Safety filters : Pre-trained pour bloquer harmful content
  • Open weights : Auditabilite complete du modele
  • Responsible AI Toolkit : Outils pour evaluer biais, toxicite
  • Gemma Scope : Interpretabilite via sparse autoencoders
python
# Evaluer toxicite avec Gemma
from transformers import pipeline

# Charger Gemma 3 2B pour classification
classifier = pipeline(
    "text-classification",
    model="google/gemma-3-2b-toxicity-classifier"
)

# Tester prompts
prompts = [
    "Comment installer Python ?",
    "Je deteste tous les [groupe]",  # Toxic
    "Explique la photosynthese"
]

for prompt in prompts:
    result = classifier(prompt)[0]
    label = result['label']  # TOXIC ou NON_TOXIC
    score = result['score']

    print(f"Prompt: {prompt[:50]}")
    print(f"  โ†’ {label} (confidence: {score:.2f})\n")

# Integration dans pipeline production
def safe_generate(prompt: str, model):
    """Generate avec toxicity check"""

    # Pre-check
    toxicity = classifier(prompt)[0]
    if toxicity['label'] == 'TOXIC' and toxicity['score'] > 0.8:
        return "Je ne peux pas repondre a cette requete."

    # Generate
    response = model.generate(prompt)

    # Post-check
    response_toxicity = classifier(response)[0]
    if response_toxicity['label'] == 'TOXIC':
        return "Reponse filtree pour contenu inapproprie."

    return response

๐ŸŒ Gemma Ecosystem

Tool Purpose Link
Gemma.cpp Inference C++ optimise (CPU) github.com/google/gemma.cpp
Gemma Android SDK Android pour on-device ai.google.dev/gemma/docs/android
Gemma Scope Interpretabilite SAE github.com/google-research/gemma-scope
Gemma Safety Toxicity/bias evaluation github.com/google/responsible-ai
Kaggle Models Download weights (free) kaggle.com/models/google/gemma

Gemma = alternative open-source a Gemini pour use cases self-hosted. Gemma 3 27B competitive avec modeles propietaires pour domaines specifiques apres fine-tuning. Gemma Nano revolutionne on-device AI (privacy, offline). Fine-tuning ROI positif si >10M tokens/mois. Gemma Scope = interpretabilite unique dans industrie. Utilisez Gemma pour : data sensible (medical, finance), offline apps, cost optimization.

Ecosysteme Google AI

โฑ 20 min Apercu

๐ŸŽฏ Objectifs d'apprentissage

  • Explorer NotebookLM et Workspace AI
  • Comprendre Code Assist et Astra DB
  • Decouvrir Mariner, Jules et AI Overviews
  • Integrer ecosysteme Google AI

๐ŸŒ Carte Ecosysteme Google AI

Gemini Core NotebookLM Workspace AI Code Assist AI Overviews Astra DB Mariner Jules Gemma

๐Ÿ“š NotebookLM : AI Research Assistant

NotebookLM transforme vos documents en assistant IA interactif.

python
# NotebookLM via API (preview)
# pip install google-notebooklm

from google.notebooklm import NotebookLM

# Creer notebook
notebook = NotebookLM.create(name="Product Documentation")

# Upload sources (PDFs, docs, URLs)
notebook.add_source(file="product_manual.pdf")
notebook.add_source(file="api_docs.md")
notebook.add_source(url="https://docs.product.com/guide")

# Query avec context automatique
response = notebook.query(
    "Comment configurer authentication OAuth ?"
)

print(response.answer)
# โ†’ Reponse synthetisee depuis les 3 sources
# โ†’ Citations automatiques vers sources

print("\nSources :")
for citation in response.citations:
    print(f"- {citation.source}: {citation.excerpt}")

# Use cases :
# - Onboarding nouveaux employes (docs entreprise)
# - Support client (knowledge base)
# - Research (papers scientifiques)
# - Audit & compliance (reglementations)

๐Ÿ’ผ Workspace AI : Gmail, Docs, Sheets

Product AI Feature Example
Gmail Help me write (email drafting) "Draft email to decline meeting professionally"
Docs Help me write (content generation) "Write product launch announcement, tone: excited"
Sheets Help me organize (data analysis) "Create pivot table summarizing sales by region"
Slides Create presentation "Create 10-slide deck on Q4 results with charts"
Meet Real-time transcription Auto-generate meeting notes with action items
๐Ÿ’ก Workspace AI Enterprise
Workspace AI utilise Gemini 1.5 Pro par default. Donnees ne sont PAS utilisees pour training. Available avec Workspace Enterprise Plus ($30/user/month).

๐Ÿ’ป Code Assist : AI Coding

python
# Cloud Code Assist = Gemini Code Assist pour GCP
# Integration IDE : VS Code, IntelliJ, Cloud Shell Editor

# Exemples use cases :

# 1. Code generation
# Prompt: "Create Cloud Function to resize images uploaded to GCS"
# โ†’ Generates complete Python code with error handling

# 2. Code explanation
# Select complex code block โ†’ "Explain this code"
# โ†’ Natural language explanation ligne par ligne

# 3. Code migration
# "Convert this App Engine app to Cloud Run"
# โ†’ Generates Dockerfile, deployment config, migration guide

# 4. Debugging
# Paste error โ†’ "How to fix this error?"
# โ†’ Root cause analysis + solution

# 5. Security review
# "Check this code for security vulnerabilities"
# โ†’ Identifies SQL injection, XSS, secrets in code

# Pricing :
# Code Assist : $19/user/month
# Alternative : GitHub Copilot ($10/month), Claude Code (free beta)

๐Ÿ—„๏ธ Astra DB : Vector Database

Astra DB (DataStax) = managed vector database optimise pour RAG avec Gemini.

python
# pip install astrapy

from astrapy.client import DataAPIClient
from vertexai.language_models import TextEmbeddingModel

# Connect to Astra DB
client = DataAPIClient(token="AstraCS:xxx")
database = client.get_database("https://xxx.apps.astra.datastax.com")
collection = database.get_collection("documents")

# Embed documents avec Gemini
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")

documents = [
    "Gemini 2.5 Pro released February 2026",
    "Context caching reduces cost by 90%",
    "Flash-8B is 75x cheaper than Pro"
]

for doc in documents:
    # Generate embedding
    embedding = embedding_model.get_embeddings([doc])[0].values

    # Insert dans Astra
    collection.insert_one({
        "text": doc,
        "embedding": embedding
    })

# Vector search
query = "How to reduce Gemini costs?"
query_embedding = embedding_model.get_embeddings([query])[0].values

results = collection.vector_find(
    vector=query_embedding,
    limit=3
)

for result in results:
    print(f"Score: {result['$similarity']:.3f} - {result['text']}")

# Astra advantages :
# - Latency <10ms (global distribution)
# - Auto-scaling (serverless)
# - Integrated with Langchain, LlamaIndex
# - Free tier : 80GB storage

๐ŸŒŠ Mariner : Web Agent

Mariner = agent Gemini qui navigue web automatiquement.

python
# Mariner (preview, disponible via Chrome extension)

# Use cases :
# 1. Research automation
#    "Find 10 competitors in AI coding assistants space with pricing"
#    โ†’ Mariner visite sites, extrait pricing, genere tableau

# 2. E-commerce
#    "Compare prices for iPhone 15 Pro on Amazon, BestBuy, Target"
#    โ†’ Mariner navigue sites, compare prix en temps reel

# 3. Travel booking
#    "Find cheapest flight Paris to NYC, March 15-22"
#    โ†’ Mariner compare Google Flights, Kayak, Expedia

# 4. Data collection
#    "Scrape product reviews from top 50 items on category page"
#    โ†’ Mariner navigue pages, extrait reviews, structure data

# Architecture :
# User query โ†’ Gemini 2.5 Pro โ†’ Plans actions โ†’ Mariner agent
#   โ†’ Executes browser actions (click, scroll, extract)
#   โ†’ Returns structured results

# Privacy : Mariner runs locally in browser, pas de data sent to Google

๐Ÿ‘จโ€๐Ÿ’ป Jules : AI Code Agent

Jules = agent autonome pour fix bugs et implement features.

bash
# Jules integration GitHub (preview)

# Workflow :
# 1. Create GitHub issue : "Fix: API returns 500 on invalid input"
# 2. Assign to @jules-ai
# 3. Jules :
#    - Reads issue description
#    - Analyzes codebase
#    - Identifies root cause
#    - Fixes bug
#    - Writes tests
#    - Creates PR with explanation
# 4. Human review โ†’ Merge

# Example issue :
# Title: "Add caching to reduce Gemini API costs"
# Description: "Implement Redis cache for repeated queries"

# Jules actions :
# - Reads current code
# - Installs Redis client
# - Implements cache layer avec TTL
# - Adds monitoring metrics
# - Creates PR avec benchmark results

# Similar tools : Devin, Cursor Agent, GitHub Copilot Workspace

๐Ÿ” AI Overviews : Search with Gemini

AI Overviews = Google Search integre Gemini pour reponses directes.

python
# AI Overviews API (preview)
# pip install google-search-ai

from google.search import AIOverviewsClient

client = AIOverviewsClient()

# Query avec AI-generated overview
query = "How to reduce Gemini API costs in production?"

result = client.search(query)

# Overview = Gemini-generated summary
print("=== AI OVERVIEW ===")
print(result.overview.text)

# Traditional search results
print("\n=== SOURCES ===")
for source in result.sources:
    print(f"- {source.title}: {source.url}")

# SORTIE EXEMPLE :
# === AI OVERVIEW ===
# To reduce Gemini API costs in production:
# 1. Use context caching for repeated content (-90% cost)
# 2. Route simple queries to Flash-8B instead of Pro (-75x cost)
# 3. Use Batch API for non-urgent workloads (-50% cost)
# 4. Compress prompts and control max_output_tokens
#
# === SOURCES ===
# - Vertex AI Pricing: https://cloud.google.com/vertex-ai/pricing
# - Context Caching Guide: https://...
# - Best Practices: https://...

# Use case : Integrate AI Overviews dans apps pour rich answers

Ecosysteme Google AI est vaste et en expansion rapide. NotebookLM revolutionne research, Workspace AI boost productivite quotidienne. Code Assist accelere dev, Astra DB optimise RAG. Mariner automatise web browsing, Jules fix bugs autonomously. AI Overviews transforme search. En 2026, convergence Gemini + tools Google = productivity multiplier. Explorez, experimentez, integrez dans workflows.

Tendances & Futur de l'IA

โฑ 20 min Vision

๐ŸŽฏ Objectifs d'apprentissage

  • Comprendre Universal Agent vision
  • Explorer Generative UI paradigm
  • Anticiper Personal Intelligence evolution
  • Preparer architecture pour futur multimodal

๐ŸŒŸ Universal Agent : One Agent to Rule Them All

Vision 2027-2030 : Agent unique capable d'accomplir toute tache digitale.

Universal Agent Web Browsing Code Writing Data Analysis Creative Work Communication Research Planning Execution
๐Ÿš€ Universal Agent Capabilities (2027+)
  • Autonomy : Complete tasks end-to-end sans intervention humaine
  • Context retention : Memoire long-term de toutes interactions
  • Multi-tool orchestration : Utilise 100+ tools selon besoin
  • Learning : Apprend de chaque interaction, s'ameliore
  • Personalization : Adapte comportement a chaque user

๐ŸŽจ Generative UI : UI qui S'Adapte

Paradigme shift : UI n'est plus statique, elle est generee dynamiquement par IA.

python
# Generative UI avec Gemini (concept 2026)

from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.5-pro")

# User request
user_request = "Je veux dashboard pour suivre mes couts Vertex AI"

# Generate UI dynamiquement
ui_generation_prompt = f"""
Generate React component code for this user request: "{user_request}"

Requirements:
- Use Recharts for visualizations
- Fetch data from /api/vertex-costs endpoint
- Responsive design with Tailwind
- Include filters: date range, model type
- Show total cost, cost by model (pie chart), daily trend (line chart)

Return ONLY valid React JSX code.
"""

response = model.generate_content(ui_generation_prompt)
react_code = response.text

# Save generated component
with open("CostDashboard.jsx", "w") as f:
    f.write(react_code)

print("โœ“ UI component generated!")

# Deploy automatiquement
# import subprocess
# subprocess.run(["npm", "run", "build"])
# subprocess.run(["gcloud", "run", "deploy", "cost-dashboard", ...])

# RESULTAT : Dashboard custom genere en <5 secondes
# โ†’ Pas besoin developer, designer
# โ†’ UI parfaitement adaptee a user request
# โ†’ Iterations rapides : "Ajoute export CSV" โ†’ regenere component

๐Ÿง  Personal Intelligence : AI qui Vous Connait

Personal Intelligence = Agent IA avec memoire complete de votre vie digitale.

python
# Personal Intelligence architecture (conceptuel)

class PersonalIntelligence:
    """Agent IA avec memoire long-term et personnalisation"""

    def __init__(self, user_id: str):
        self.user_id = user_id
        self.memory = self._load_memory()  # All interactions history
        self.preferences = self._load_preferences()
        self.context = self._load_context()  # Calendar, emails, docs

    def process_request(self, request: str):
        """Process request avec contexte personnel complet"""

        # Enrichir request avec contexte
        enriched_prompt = f"""
User: {self.user_id}
Request: {request}

Personal context:
- Preferences: {self.preferences}
- Recent interactions: {self.memory[-10:]}
- Current calendar: {self.context['calendar_today']}
- Work projects: {self.context['active_projects']}

Generate personalized response considering all context.
"""

        response = self.model.generate_content(enriched_prompt)

        # Save interaction to memory
        self.memory.append({
            "timestamp": datetime.now(),
            "request": request,
            "response": response.text
        })
        self._save_memory()

        return response.text

# Examples use cases :

# 1. "Schedule meeting with Sarah"
# โ†’ Agent knows Sarah's email, checks both calendars, proposes 3 slots

# 2. "Summarize what I missed this morning"
# โ†’ Agent reads emails, Slack, calendar, genere summary personnalise

# 3. "Draft response to client email"
# โ†’ Agent knows client history, your writing style, projet context

# 4. "Should I approve this expense?"
# โ†’ Agent knows budget, spending patterns, company policies

# Privacy considerations :
# - All data encrypted at rest
# - User control over what data is accessible
# - Opt-in pour chaque data source
# - Delete memory on demand

๐Ÿ“ฑ Multimodal Native : Beyond Text

Futur : IA comprehend simultanรฉment text, image, audio, video, code, 3D.

python
# Multimodal use case futuriste

# Input : Voice + Screen share + Camera
# "Aide-moi a debugger cette app, voici mon ecran et le code"

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-3.0-ultra")  # Hypothetical 2027 model

# Multimodal input
response = model.generate_content([
    Part.from_audio_file("voice_explanation.wav"),  # Voice explanation
    Part.from_image_file("screenshot_error.png"),   # Screenshot with error
    Part.from_video_file("screen_recording.mp4"),   # Screen recording
    Part.from_text(open("app.py").read()),          # Source code
    "Debug this application and suggest fixes"
])

# Output : Multimodal response
# - Text explanation of bug
# - Code diff with fixes
# - Video tutorial showing how to fix
# - Voice explanation of root cause

print(response.text)  # Textual explanation

# Access other modalities
if response.video:
    response.video.save("fix_tutorial.mp4")

if response.audio:
    response.audio.save("explanation.mp3")

# Use case futur : "Design a logo for my company"
# โ†’ Input : voice description + mood board images
# โ†’ Output : 5 logo variations (SVG + PNG) + usage guidelines PDF

๐Ÿ”ฎ On-Device AI : Privacy First

Tendance 2026-2030 : Modeles puissants executent localement sur devices.

๐Ÿ“ฑ On-Device AI Benefits
  • Privacy : Donnees ne quittent jamais device
  • Latency : Inference instant (<100ms)
  • Offline : Fonctionne sans internet
  • Cost : Pas de frais API
  • Scale : Millions users sans infra backend

๐ŸŒ Tendances 2026-2030

Tendance Timeline Impact
Universal Agent 2027-2028 1 agent remplace 100+ apps specialisees
Generative UI 2026-2027 Developpeurs frontend reduits de 50%
Personal Intelligence 2027-2029 Productivite personnelle +30-50%
Multimodal natif 2026-2027 Text-only devient obsolete
On-device AI 2026-2028 Cloud AI complementaire, pas primaire
AI-first OS 2028-2030 OS traditionnels remplaces par AI OS

Futur IA est multimodal, autonome, personnel, on-device. Universal Agent remplacera apps specialisees. Generative UI eliminera besoin UI designers pour use cases standards. Personal Intelligence deviendra extension naturelle cognition humaine. Preparez architectures pour ce futur : APIs modulaires, data ownership user-centric, privacy by design. 2026 = debut transformation, 2030 = monde different.

Projet Final : Architecture Enterprise Complete

โฑ 4-6 heures Projet

๐ŸŽฏ Objectif du Projet

Concevoir et documenter une architecture Gemini enterprise complete pour un cas d'usage reel, integrant tous les concepts de Phase 4.

๐Ÿ“‹ Cahier des Charges

๐Ÿข Scenario : E-Commerce Customer Service Platform

Entreprise : TechMart, e-commerce 50M users, 10M transactions/mois

Besoin : Platform support client AI-powered avec agents autonomes

Contraintes :

  • Budget : $10,000/mois pour Vertex AI
  • SLA : 99.9% uptime, <2s latency p95
  • Compliance : GDPR, PCI-DSS
  • Scale : Support 100,000 conversations/jour
  • Langues : FR, EN, ES, DE

๐Ÿ“ Livrables Requis

1. Architecture Diagram (30 min)

Creer diagram architecture complete incluant :

  • Multi-model routing (Pro/Flash/Flash-8B)
  • RAG avec vector database
  • Agent system avec tools
  • Cache strategy
  • Monitoring & alerting
  • Security layers (VPC-SC, DLP, IAM)
yaml
# architecture.yaml - Exemple structure

components:
  frontend:
    type: Cloud Run
    replicas: 3-10 (autoscaling)
    regions: [us-central1, europe-west1]

  model_router:
    type: Cloud Run
    logic: |
      - Simple queries (FAQ) โ†’ Flash-8B
      - Standard (order status) โ†’ Flash
      - Complex (complaints) โ†’ Pro
    fallback: Pro

  rag_system:
    vector_db: Vertex AI Vector Search
    embeddings: text-embedding-004
    chunk_size: 512 tokens
    top_k: 5

  agent_system:
    framework: Vertex AI Agent Builder
    tools:
      - search_knowledge_base (LOW risk)
      - lookup_order (MEDIUM risk)
      - process_refund (HIGH risk)
      - update_customer_info (MEDIUM risk)
    governance: Tool approval workflow

  caching:
    explicit: System instructions (5000 tokens, TTL 60min)
    implicit: Auto-caching prefixes >1024 tokens

  monitoring:
    metrics:
      - Request count by model
      - Latency p50/p95/p99
      - Cost per conversation
      - Error rate
      - Safety filter triggers
    dashboards: Looker Studio + BigQuery
    alerts:
      - Budget >90% โ†’ Email + PagerDuty
      - Error rate >2% โ†’ PagerDuty
      - Latency p95 >3s โ†’ Slack

  security:
    vpc_sc: Perimeter around Vertex AI
    dlp: Scan prompts for PII before sending
    iam:
      - agents: roles/aiplatform.user
      - developers: roles/aiplatform.admin
    cmek: Customer-managed keys for data
    audit: Data Access logs enabled

2. Implementation Plan (45 min)

Document plan implementation detaille :

markdown
# Implementation Plan

## Phase 1 : Foundation (Week 1-2)
- [ ] Setup GCP project avec VPC-SC
- [ ] Configure IAM roles et service accounts
- [ ] Deploy base infrastructure (Cloud Run, Firestore)
- [ ] Implement model router avec Flash-8B/Flash/Pro
- [ ] Setup monitoring dashboard (BigQuery + Looker)

## Phase 2 : RAG System (Week 3-4)
- [ ] Ingest knowledge base (product docs, FAQs)
- [ ] Setup Vertex AI Vector Search
- [ ] Implement chunking strategy (512 tokens)
- [ ] Test retrieval quality (measure precision@5)
- [ ] Optimize embeddings model

## Phase 3 : Agent System (Week 5-6)
- [ ] Register tools dans tool registry
- [ ] Implement tool approval workflow
- [ ] Deploy agents avec Vertex AI Agent Builder
- [ ] Configure agent permissions
- [ ] Test agent workflows end-to-end

## Phase 4 : Optimization (Week 7-8)
- [ ] Implement context caching (system instructions)
- [ ] Configure batch processing for analytics
- [ ] Optimize prompts (-30% tokens)
- [ ] Setup cost attribution labels
- [ ] Run load tests (100K requests/day)

## Phase 5 : Security & Compliance (Week 9-10)
- [ ] Enable DLP for PII detection
- [ ] Configure safety settings (BLOCK_LOW_AND_ABOVE)
- [ ] Implement audit logging
- [ ] GDPR compliance review
- [ ] Security penetration testing

## Phase 6 : Production Deploy (Week 11-12)
- [ ] Canary deploy (10% traffic)
- [ ] Monitor metrics for 3 days
- [ ] Rollout to 50%
- [ ] Full production (100%)
- [ ] Post-deploy monitoring 2 weeks

3. Cost Model (45 min)

Calculer couts detailles :

python
# cost_model.py

class CostModel:
    def __init__(self):
        # Pricing ($/1M tokens)
        self.prices = {
            "flash-8b": {"input": 0.04, "output": 0.16},
            "flash": {"input": 0.15, "output": 0.60},
            "pro": {"input": 3.00, "output": 12.00},
            "cache": 0.015,
            "embedding": 0.025,
        }

    def calculate_monthly_cost(self,
                                conversations_per_day: int,
                                avg_messages_per_conversation: int):
        """Calculer cout mensuel"""

        total_conversations = conversations_per_day * 30

        # Model distribution (apres routing)
        flash_8b_pct = 0.60  # 60% simple queries
        flash_pct = 0.30     # 30% standard
        pro_pct = 0.10       # 10% complex

        # Tokens par message
        system_instruction_tokens = 5000  # Cached
        user_input_tokens = 300
        rag_context_tokens = 2000
        output_tokens = 150

        # Total messages
        total_messages = total_conversations * avg_messages_per_conversation

        # Cost breakdown
        costs = {}

        # 1. System instruction (cached)
        cache_cost = (system_instruction_tokens * total_messages / 1_000_000) * self.prices["cache"]
        costs["cache"] = cache_cost

        # 2. Embeddings (RAG)
        embedding_cost = (user_input_tokens * total_messages / 1_000_000) * self.prices["embedding"]
        costs["embeddings"] = embedding_cost

        # 3. LLM calls
        for model, pct in [("flash-8b", flash_8b_pct), ("flash", flash_pct), ("pro", pro_pct)]:
            model_messages = total_messages * pct
            input_tokens = user_input_tokens + rag_context_tokens

            input_cost = (input_tokens * model_messages / 1_000_000) * self.prices[model]["input"]
            output_cost = (output_tokens * model_messages / 1_000_000) * self.prices[model]["output"]

            costs[f"{model}_input"] = input_cost
            costs[f"{model}_output"] = output_cost

        total_cost = sum(costs.values())

        return {
            "total_monthly": total_cost,
            "cost_per_conversation": total_cost / total_conversations,
            "breakdown": costs
        }

# Calculate pour TechMart
model = CostModel()
result = model.calculate_monthly_cost(
    conversations_per_day=100_000,
    avg_messages_per_conversation=5
)

print(f"=== COST MODEL ===")
print(f"Total monthly: ${result['total_monthly']:.2f}")
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"\nBreakdown:")
for item, cost in result['breakdown'].items():
    print(f"  {item}: ${cost:.2f}")

# Expected output :
# Total monthly: ~$8,500
# Cost per conversation: ~$0.0028
# โ†’ Dans budget $10,000/mois

4. ADR Documentation (30 min)

Ecrire 3 ADRs pour decisions cles :

markdown
# ADR-001: Multi-Model Routing Strategy

## Status
ACCEPTED

## Context
TechMart needs to support 100K conversations/day within $10K/month budget.
Using only Pro would cost ~$45K/month. Using only Flash-8B would degrade quality.

## Decision
Implement intelligent model routing:
- Flash-8B (60% traffic): FAQ, simple queries
- Flash (30% traffic): Order status, standard support
- Pro (10% traffic): Complex complaints, escalations

Classifier: Flash-8B with 100-token prompt.

## Consequences
### Positive
- Cost reduced from $45K to $8.5K/month (-81%)
- Quality maintained (85% CSAT vs 87% all-Pro)
- Classifier cost negligible ($50/month)

### Negative
- Added complexity (router service)
- Potential misclassification (~5% rate)

### Mitigation
- Monitor classification accuracy
- Fallback to Pro on errors
- Weekly review misclassified queries

---

# ADR-002: Context Caching for System Instructions

## Status
ACCEPTED

## Context
System instruction contains 5000 tokens (product catalog, policies, FAQs).
Without caching: $0.00075 per message ร— 15M messages = $11,250/month just for system instruction.

## Decision
Enable explicit context caching with 60-min TTL.
Pre-warm cache every 55 minutes to avoid cold starts.

## Consequences
### Positive
- Cache cost: $1,125/month (vs $11,250 without)
- Savings: $10,125/month (-90%)
- No latency impact

### Negative
- Cache management complexity
- Risk of stale cache if system instruction changes

### Mitigation
- Cache invalidation on system instruction update
- Monitoring cache hit rate (target >95%)

---

# ADR-003: DLP for PII Protection

## Status
ACCEPTED

## Context
GDPR requires protecting customer PII.
Risk: Customers may share SSN, credit cards in chat.

## Decision
Implement Cloud DLP to scan all user messages before sending to Gemini.
Redact: SSN, credit cards, emails, phone numbers.

## Consequences
### Positive
- GDPR compliance
- Protect customer privacy
- Prevent PII leakage to LLM

### Negative
- Added latency: +50-100ms per message
- Cost: $0.000015 per message = $450/month

### Mitigation
- Async DLP (non-blocking for non-PII messages)
- Cache DLP results for repeated messages

5. Security Checklist (20 min)

Security Control Implementation Status
VPC Service Controls Perimeter around Vertex AI, block data exfiltration โœ… Required
DLP PII Scanning Scan all prompts, redact SSN/CC/email โœ… Required
IAM Least Privilege Service accounts with minimal roles โœ… Required
Audit Logging Data Access logs for all Vertex AI calls โœ… Required
Safety Settings BLOCK_LOW_AND_ABOVE for all categories โœ… Required
CMEK Customer-managed encryption keys โš ๏ธ Optional (highly recommended)
Private Service Connect Vertex AI access via private endpoint โš ๏ธ Optional (if ultra-secure network)
Secrets Management API keys in Secret Manager โœ… Required

6. Monitoring Dashboard (20 min)

Definir metriques et alertes :

yaml
# monitoring.yaml

dashboards:
  overview:
    metrics:
      - Total conversations (24h)
      - Active conversations (realtime)
      - Avg response time
      - Cost today vs budget
      - Error rate

  performance:
    metrics:
      - Latency p50/p95/p99 by model
      - Cache hit rate
      - RAG retrieval quality (precision@5)
      - Agent tool call success rate

  cost:
    metrics:
      - Cost by model (pie chart)
      - Daily cost trend (30 days)
      - Cost per conversation
      - Budget utilization (%)

  quality:
    metrics:
      - CSAT score
      - Resolution rate
      - Escalation rate
      - Safety filter blocks

alerts:
  - name: Budget Alert
    condition: daily_cost > $400
    channels: [email, slack]
    severity: warning

  - name: Error Rate High
    condition: error_rate > 2%
    channels: [pagerduty]
    severity: critical

  - name: Latency Degradation
    condition: p95_latency > 3000ms
    channels: [slack]
    severity: warning

  - name: Cache Hit Rate Low
    condition: cache_hit_rate < 90%
    channels: [email]
    severity: info

  - name: Safety Filter Spike
    condition: safety_blocks > 100/hour
    channels: [email, slack]
    severity: warning

๐Ÿ“Š Criteres Evaluation

Critere Points Description
Architecture 25 Completude, coherence, scalabilite
Cost Optimization 20 Model routing, caching, cost model realiste
Security 20 VPC-SC, DLP, IAM, compliance
Implementation Plan 15 Realisme, timeline, dependencies
Monitoring 10 Metriques pertinentes, alertes actionable
Documentation 10 ADRs, diagrammes, clarte

Total : 100 points

Seuil validation : 70/100

Ce projet final synthetise tout Phase 4. Architecture solide = fondation succes production. Prenez temps pour bien concevoir avant implementer. Validez assumptions avec calculs couts. Documentez decisions (ADRs). En entreprise, ce type design doc = prerequisite avant dev sprint. Qualite architecture determine succes long-terme projet IA.

Examen Final & Certification

โฑ 60 min Certification

๐ŸŽฏ Objectif

Valider maitrise complete Phase 4 : Deploiement Enterprise, FinOps, Gouvernance.

Format : 30 questions QCM + validation projet final

Duree : 60 minutes

Seuil reussite : 24/30 (80%)

๐Ÿ“ Examen Final : 30 Questions

1. Quelle difference principale entre AI Studio et Vertex AI ?

Prix differents
Modeles differents
Vertex AI offre VPC-SC, IAM enterprise, SLA
Aucune difference

2. Pour deployer API Gemini serverless, quelle solution ?

Compute Engine
Cloud Run
GKE Autopilot
App Engine

3. Comment eliminer cold starts Cloud Run ?

min-instances > 0
max-instances > 100
Augmenter memoire
Impossible

4. VPC Service Controls permet de :

Reduire couts
Accelerer inference
Empecher exfiltration donnees
Activer caching

5. Ou stocker API keys securise ?

Code source Git
Secret Manager
Fichier .env
Variables Cloud Run

6. DLP (Data Loss Prevention) sert a :

Detecter et masquer PII avant Gemini
Sauvegarder donnees
Monitorer couts
Accelerer requetes

7. Tiered pricing Gemini signifie :

Prix augmente avec volume
Prix varie par region
Remise entreprise
Prix divise par 2 au-dela 200K tokens

8. Context caching reduit cout de combien ?

-50%
-75%
-90% sur partie cachee
-99%

9. Model routing intelligent peut economiser :

60-70%
20-30%
10-15%
5-10%

10. Batch API offre reduction cout de :

-30%
-50%
-70%
-90%

11. Context caching rentable des combien requetes ?

2 requetes (ROI immediate)
10 requetes
100 requetes
1000 requetes

12. Flash-8B coute combien vs Pro ?

2x moins cher
10x moins cher
75x moins cher
Meme prix

13. BigQuery billing export est :

Payant ($100/mois)
Gratuit et essentiel FinOps
Optionnel
Uniquement Enterprise

14. Budget alerts recommandes a :

100%
80% et 100%
50%, 90%, 100%
Pas necessaire

15. Output tokens coutent combien vs input (Flash) ?

Meme prix
4x plus cher
2x plus cher
10x plus cher

16. Les 7 principes Google AI incluent :

Be socially beneficial, Avoid bias, Safety, Privacy
Profit, Growth, Innovation
Speed, Cost, Quality
Open source, Free, Fast

17. Safety settings BLOCK_LOW_AND_ABOVE signifie :

Bloquer seulement haute probabilite
Pas de blocage
Bloquer low, medium, high (strict)
Bloquer uniquement low

18. Gemma Scope permet :

Accelerer inference
Interpretabilite modele via SAE
Reduire couts
Fine-tuning automatique

19. Model lifecycle stages sont :

Dev โ†’ Staging โ†’ Production โ†’ Deprecation
Planning โ†’ Build โ†’ Deploy
Alpha โ†’ Beta โ†’ GA
Train โ†’ Test โ†’ Validate

20. ADR (Architecture Decision Record) documente :

Code source
API endpoints
Decisions architecture avec context/consequences
User stories

21. Tool governance HIGH risk requiert :

Pas d'approbation
Approval manager + audit trail
Auto-approve
Blocage total

22. Agent audit trail doit logger :

Tous appels tools avec params/results
Uniquement erreurs
Rien (privacy)
Uniquement couts

23. Gemma 3 vs Gemini difference principale :

Gemma plus rapide
Gemma moins cher API
Gemma open-source, self-hosted
Aucune difference

24. Gemma Nano use case principal :

Cloud servers
On-device inference (smartphones)
Batch processing
Fine-tuning

25. NotebookLM permet :

Transformer docs en assistant IA interactif
Executer code Python
Heberger modeles
Gerer budgets

26. Astra DB est optimise pour :

Analytics
Transactions
Vector search RAG avec Gemini
Data warehousing

27. Universal Agent vision 2027+ :

Agent par use case
Un agent pour toutes taches digitales
Pas d'agents, seulement APIs
Agents hardware uniquement

28. Generative UI paradigm shift :

UI generee dynamiquement par IA selon besoin
UI statique predefined
Pas d'UI
UI 3D realite virtuelle

29. On-device AI principal avantage :

Plus rapide que cloud
Moins cher que cloud
Privacy totale + offline
Qualite superieure

30. Canary deployment signifie :

Deploy simultane tous serveurs
Deploy 10% traffic, monitor, puis augmenter progressivement
Deploy uniquement nuit
Deploy avec rollback automatique

๐ŸŽ“ Obtenir la Certification

๐Ÿ“œ Certification Architecte Gemini

Criteres validation :

  • โœ… Examen final : 24/30 minimum (80%)
  • โœ… Projet final : 70/100 minimum
  • โœ… Toutes lecons Phase 4 completees

Certification delivree :

  • PDF certificate avec QR code verification
  • Badge LinkedIn "Architecte Gemini Enterprise"
  • Acces communaute architectes certifies

๐Ÿš€ Prochaines Etapes

Felicitations pour avoir complete Phase 4 !

Vous maitrisez maintenant :

  • โœ… Deploiement enterprise production-ready
  • โœ… FinOps & optimisation couts (60-80% economie)
  • โœ… Gouvernance modeles et agents
  • โœ… IA Responsable et compliance
  • โœ… Ecosysteme Google AI complet

Continuer apprentissage :

  1. Implementer projet reel : Appliquez architecture sur use case entreprise
  2. Contribuer open-source : Gemma, Gemma Scope, VertexAI samples
  3. Rejoindre communaute : Google AI Discord, forums GCP
  4. Suivre actualites : Google AI Blog, Vertex AI release notes
  5. Certifications complementaires : Professional Cloud Architect GCP

Resources :

  • ๐Ÿ“š Documentation : cloud.google.com/vertex-ai/docs
  • ๐Ÿ’ฌ Community : discord.gg/google-ai
  • ๐ŸŽฅ Videos : YouTube @GoogleCloudTech
  • ๐Ÿ“ฐ Blog : cloud.google.com/blog/products/ai-machine-learning

Vous etes maintenant Architecte Gemini certifie. Allez builder des applications IA incroyables !