Phase 4 - Production & Enterprise | Gemini Architect Academy

Google AI Studio vs Vertex AI

⏱ 20 min Intermediaire

🎯 Objectifs d'apprentissage

Comprendre les differences entre Google AI Studio et Vertex AI
Savoir choisir la bonne plateforme selon le cas d'usage
Planifier une migration de AI Studio vers Vertex AI
Maitriser les criteres de decision pour l'entreprise

📊 Comparaison Complete

Critere	Google AI Studio	Vertex AI
Public cible	Developpeurs, prototypage rapide	Entreprises, production
Acces	Compte Google gratuit	Projet GCP avec facturation
API Key	Simple API key (generativelanguage.googleapis.com)	Service Account, ADC (PROJECT-aiplatform.googleapis.com)
Quotas	Limites par defaut (60 req/min)	Quotas personnalisables, augmentables
Securite	API key partageable	IAM, VPC-SC, CMEK, Private Service Connect
Conformite	Aucune garantie	SOC2, ISO 27001, HIPAA, GDPR
Data residency	Multi-region (EU ou US)	Region specifique choisie
Monitoring	Basique dans console	Cloud Monitoring, logging, tracing, SLIs
Caching	Context Caching disponible	Context Caching + optimisations enterprise
Prix	Meme tarif que Vertex AI	Meme tarif + options enterprise
SLA	Aucun	99.9% uptime (GA models)

Google AI Studio est parfait pour prototyper rapidement, tester des prompts, et creer des demos. Mais des que vous passez en production avec des donnees sensibles ou des exigences de conformite, Vertex AI devient indispensable.

🏗 Architecture de Decision

┌─────────────────────────────────────────────────────────┐
│                 DECISION : AI Studio vs Vertex AI        │
└─────────────────────────────────────────────────────────┘
                            │
                            ▼
                  ┌─────────────────────┐
                  │  Cas d'usage ?      │
                  └─────────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PROTOTYPAGE     │ │ PRODUCTION      │ │ ENTREPRISE      │
│                 │ │ SIMPLE          │ │ CRITIQUE        │
└─────────────────┘ └─────────────────┘ └─────────────────┘
         │                  │                  │
         ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ AI STUDIO ✓     │ │ Vertex AI       │ │ Vertex AI ✓     │
│                 │ │ (recommande)    │ │ + VPC-SC        │
│ - Rapide        │ │                 │ │ + CMEK          │
│ - Gratuit       │ │ - Monitoring    │ │ - Compliance    │
│ - Experimentation│ │ - Quotas        │ │ - Data residency│
└─────────────────┘ └─────────────────┘ └─────────────────┘

🔄 Migration AI Studio → Vertex AI

Etape 1 : Creer un projet GCP

bash

# 1. Creer projet GCP
gcloud projects create mon-projet-gemini --name="Gemini Production"

# 2. Activer APIs
gcloud services enable aiplatform.googleapis.com
gcloud services enable cloudresourcemanager.googleapis.com

# 3. Creer Service Account
gcloud iam service-accounts create gemini-sa \
  --display-name="Gemini Service Account"

# 4. Donner permissions
gcloud projects add-iam-policy-binding mon-projet-gemini \
  --member="serviceAccount:gemini-sa@mon-projet-gemini.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

Etape 2 : Adapter le code

python

# AVANT : AI Studio
import google.generativeai as genai

genai.configure(api_key="AIzaSy...")
model = genai.GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")

# APRES : Vertex AI
from vertexai.generative_models import GenerativeModel
import vertexai

vertexai.init(project="mon-projet-gemini", location="us-central1")
model = GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")

💡 Migration sans friction

L'API Vertex AI est quasi-identique a celle de AI Studio. Seule l'initialisation change. Les prompts, parametres et reponses restent les memes.

⚖️ Criteres de Decision Enterprise

Utilisez Vertex AI si :

✅ Vous traitez des donnees clients sensibles (PII, PHI)
✅ Vous avez besoin de conformite (GDPR, HIPAA, SOC2)
✅ Vous voulez controler la region de traitement des donnees
✅ Vous avez besoin de quotas eleves (>60 req/min)
✅ Vous voulez un SLA avec uptime 99.9%
✅ Vous devez integrer avec VPC, Private Service Connect
✅ Vous avez besoin d'audit logs detailles
✅ Vous voulez du monitoring avance (Cloud Monitoring)

Utilisez AI Studio si :

✅ Vous etes en phase de prototypage/experimentation
✅ Vous n'avez pas de donnees sensibles
✅ Vous voulez tester rapidement sans setup GCP
✅ Vous explorez les capacites de Gemini
✅ Vous creez une demo ou un hackathon

⚠️ Attention aux API Keys

Les API keys de AI Studio ne doivent JAMAIS etre commitees dans Git ni exposees cote client. Pour la production, utilisez toujours Vertex AI avec Service Accounts et Application Default Credentials (ADC).

Vertex AI : Setup Enterprise

⏱ 25 min Avance

🎯 Objectifs d'apprentissage

Configurer un projet Vertex AI production-ready
Maitriser IAM, VPC-SC, CMEK pour la securite
Configurer Private Service Connect pour l'isolation
Gerer les quotas et limites

🏗 Architecture Vertex AI Enterprise

┌────────────────────────────────────────────────────────────┐
│                    VERTEX AI ENTERPRISE                     │
└────────────────────────────────────────────────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│   SECURITE      │ │  NETWORKING     │ │  MONITORING     │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│                   │                   │
│ - IAM Policies    │ - VPC-SC          │ - Cloud Logging
│ - Service Account │ - Private Connect │ - Cloud Monitoring
│ - CMEK            │ - Shared VPC      │ - Audit Logs
│ - Workload ID     │ - Cloud NAT       │ - Cost Dashboard
└───────────────────┴───────────────────┴───────────────────┘

🔐 Configuration IAM

Roles principaux :

Role	Permissions	Cas d'usage
`roles/aiplatform.user`	Utiliser Gemini, lire modeles	Applications backend, services
`roles/aiplatform.admin`	Gerer endpoints, datasets	Admins ML, DevOps
`roles/aiplatform.viewer`	Lire ressources (read-only)	Monitoring, audit
`roles/serviceusage.serviceUsageConsumer`	Consommer APIs	Toutes applications

bash

# Configuration IAM complete
PROJECT_ID="mon-projet-prod"
SA_NAME="gemini-backend-sa"
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

# 1. Creer Service Account
gcloud iam service-accounts create $SA_NAME \
  --display-name="Gemini Backend Service" \
  --description="SA for production Gemini API calls"

# 2. Donner permissions minimales (Principle of Least Privilege)
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_EMAIL" \
  --role="roles/aiplatform.user"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:$SA_EMAIL" \
  --role="roles/serviceusage.serviceUsageConsumer"

# 3. (Optionnel) Workload Identity pour GKE
gcloud iam service-accounts add-iam-policy-binding $SA_EMAIL \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:${PROJECT_ID}.svc.id.goog[NAMESPACE/KSA_NAME]"

# 4. Generer key (si besoin, prefer ADC)
gcloud iam service-accounts keys create key.json \
  --iam-account=$SA_EMAIL

🛡 VPC Service Controls (VPC-SC)

VPC-SC cree un perimetre de securite pour proteger les donnees contre l'exfiltration.

bash

# 1. Creer Access Policy
gcloud access-context-manager policies create \
  --organization ORG_ID \
  --title "Gemini Production Policy"

# 2. Creer Service Perimeter
gcloud access-context-manager perimeters create gemini_perimeter \
  --title="Gemini Secure Perimeter" \
  --resources=projects/PROJECT_NUMBER \
  --restricted-services=aiplatform.googleapis.com \
  --policy=POLICY_ID

# 3. Autoriser Private Service Connect
gcloud access-context-manager perimeters update gemini_perimeter \
  --add-vpc-allowed-services=aiplatform.googleapis.com \
  --policy=POLICY_ID

💡 VPC-SC en pratique

VPC-SC empeche les appels Vertex AI depuis l'exterieur du perimetre. Ideal pour les donnees HIPAA ou financieres. Mais attention : cela peut bloquer vos developpeurs locaux. Utilisez Access Levels pour autoriser certains IPs.

🔒 CMEK (Customer-Managed Encryption Keys)

Par defaut, Google chiffre toutes les donnees. CMEK vous donne le controle des cles de chiffrement.

bash

# 1. Creer Key Ring et Key dans Cloud KMS
gcloud kms keyrings create gemini-keyring \
  --location=us-central1

gcloud kms keys create gemini-key \
  --location=us-central1 \
  --keyring=gemini-keyring \
  --purpose=encryption

# 2. Donner acces a Vertex AI Service Account
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
SA_VERTEX="service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com"

gcloud kms keys add-iam-policy-binding gemini-key \
  --location=us-central1 \
  --keyring=gemini-keyring \
  --member="serviceAccount:$SA_VERTEX" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

# 3. Utiliser CMEK dans Vertex AI (via console ou API)
# Lors de la creation d'un endpoint ou dataset, specifier :
# encryption_spec_key_name = "projects/PROJECT/locations/LOCATION/keyRings/RING/cryptoKeys/KEY"

🌐 Private Service Connect

Private Service Connect permet d'appeler Vertex AI depuis votre VPC sans passer par Internet.

┌──────────────────────────────────────────────────────────┐
│                      VPC (10.0.0.0/16)                    │
│  ┌─────────────────┐         ┌─────────────────┐         │
│  │  GKE Cluster    │────────▶│ Private Service │         │
│  │  (10.0.1.0/24)  │         │ Connect Endpoint│         │
│  └─────────────────┘         └─────────────────┘         │
│                                       │                   │
└───────────────────────────────────────┼───────────────────┘
                                        │ (Traffic prive)
                                        ▼
                        ┌───────────────────────────┐
                        │   Vertex AI Service       │
                        │ (aiplatform.googleapis.com)│
                        └───────────────────────────┘

bash

# Configuration Private Service Connect pour Vertex AI
gcloud compute addresses create vertex-ai-psc \
  --region=us-central1 \
  --subnet=default \
  --addresses=10.0.2.10

gcloud compute forwarding-rules create vertex-ai-psc-rule \
  --region=us-central1 \
  --network=default \
  --address=vertex-ai-psc \
  --target-service-attachment=projects/PROJECT/regions/us-central1/serviceAttachments/aiplatform

📊 Gestion des Quotas

Quotas par defaut Vertex AI :

Gemini Pro : 60 req/min, 4000 req/jour
Gemini Flash : 1000 req/min, 10000 req/jour
Gemini Flash-Lite : 1500 req/min, 15000 req/jour
Tokens max : 2M tokens/min (input + output combines)

bash

# Verifier quotas actuels
gcloud services quota list \
  --service=aiplatform.googleapis.com \
  --consumer=projects/$PROJECT_ID

# Demander augmentation de quota (via console ou support)
# Console GCP > IAM & Admin > Quotas > Filtrer "aiplatform" > Modifier

Pour la production, anticipez vos besoins de quotas. Une demande d'augmentation peut prendre 2-3 jours ouvrables. Mettez en place du rate limiting cote application et des fallbacks pour gerer les depassements de quota gracieusement.

Infrastructure & Scalabilite

⏱ 30 min Avance

🎯 Objectifs d'apprentissage

Deployer Gemini sur Cloud Run, GKE, Cloud Functions
Configurer l'auto-scaling et load balancing
Optimiser la latence avec CDN et caching
Concevoir une architecture hautement disponible

☁️ Options de Deploiement

Solution	Cas d'usage	Avantages	Limites
Cloud Run	APIs serverless, microservices	Auto-scaling, pay-per-use, simple	Cold starts (~1-2s)
GKE (Kubernetes)	Workloads complexes, controle total	Flexible, multi-cloud, scaling precis	Complexite, overhead
Cloud Functions	Event-driven, webhooks simples	Tres simple, integrations natives	Timeout 60min, cold starts
Compute Engine	Legacy apps, controle VM	Controle total, compatible legacy	Pas d'auto-scaling automatique

🚀 Deploiement Cloud Run (Recommande)

python

# main.py - Service Gemini sur Cloud Run
from flask import Flask, request, jsonify
from vertexai.generative_models import GenerativeModel
import vertexai
import os

app = Flask(__name__)

# Init Vertex AI (utilise ADC automatiquement sur Cloud Run)
vertexai.init(
    project=os.environ.get("GCP_PROJECT"),
    location=os.environ.get("GCP_REGION", "us-central1")
)

model = GenerativeModel("gemini-2.0-flash-exp")

@app.route("/generate", methods=["POST"])
def generate():
    try:
        data = request.get_json()
        prompt = data.get("prompt")

        # Generation avec retry automatique
        response = model.generate_content(
            prompt,
            generation_config={
                "temperature": 0.7,
                "max_output_tokens": 2048
            }
        )

        return jsonify({
            "text": response.text,
            "usage": {
                "prompt_tokens": response.usage_metadata.prompt_token_count,
                "candidates_tokens": response.usage_metadata.candidates_token_count
            }
        })
    except Exception as e:
        return jsonify({"error": str(e)}), 500

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "healthy"}), 200

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

dockerfile

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

# Healthcheck pour Cloud Run
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8080/health')"

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "4", "--threads", "2", "--timeout", "300", "main:app"]

bash

# Deploiement Cloud Run avec optimisations
gcloud run deploy gemini-api \
  --source . \
  --region us-central1 \
  --platform managed \
  --allow-unauthenticated \
  --service-account gemini-backend-sa@PROJECT_ID.iam.gserviceaccount.com \
  --set-env-vars GCP_PROJECT=PROJECT_ID,GCP_REGION=us-central1 \
  --memory 2Gi \
  --cpu 2 \
  --min-instances 1 \
  --max-instances 100 \
  --concurrency 80 \
  --timeout 300 \
  --cpu-boost \
  --execution-environment gen2

# Configuration auto-scaling
gcloud run services update gemini-api \
  --region us-central1 \
  --cpu-throttling \
  --max-instances 100 \
  --min-instances 2

💡 Optimisation Cloud Run

--min-instances 2 : Elimine cold starts pour 99% des requetes
--cpu-boost : Accelere le demarrage des instances (~30% plus rapide)
--concurrency 80 : Equilibre entre throughput et latence
--execution-environment gen2 : 2x plus rapide, meilleure isolation

⚓ Deploiement GKE (Kubernetes)

yaml

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemini-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gemini-api
  template:
    metadata:
      labels:
        app: gemini-api
    spec:
      serviceAccountName: gemini-k8s-sa
      containers:
      - name: gemini-api
        image: gcr.io/PROJECT_ID/gemini-api:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: GCP_PROJECT
          value: "PROJECT_ID"
        - name: GCP_REGION
          value: "us-central1"
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: gemini-api-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: gemini-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gemini-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gemini-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

🌍 Load Balancing & CDN

┌───────────────────────────────────────────────────────────┐
│                    GLOBAL LOAD BALANCER                    │
│                  (Cloud Load Balancing)                    │
└───────────────────────────────────────────────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         ▼                  ▼                  ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│  us-central1    │ │   europe-west1  │ │  asia-east1     │
│  Cloud Run      │ │   Cloud Run     │ │  Cloud Run      │
│  (3 instances)  │ │   (3 instances) │ │  (3 instances)  │
└─────────────────┘ └─────────────────┘ └─────────────────┘
         │                  │                  │
         └──────────────────┼──────────────────┘
                            ▼
                    ┌───────────────┐
                    │  Vertex AI    │
                    │ (us-central1) │
                    └───────────────┘

bash

# Configuration Load Balancer global avec Cloud CDN
# 1. Creer NEG (Network Endpoint Group) pour Cloud Run
gcloud compute network-endpoint-groups create gemini-api-neg \
  --region=us-central1 \
  --network-endpoint-type=serverless \
  --cloud-run-service=gemini-api

# 2. Creer Backend Service avec CDN
gcloud compute backend-services create gemini-backend \
  --global \
  --enable-cdn \
  --cache-mode=CACHE_ALL_STATIC \
  --default-ttl=3600

# 3. Ajouter NEG au backend
gcloud compute backend-services add-backend gemini-backend \
  --global \
  --network-endpoint-group=gemini-api-neg \
  --network-endpoint-group-region=us-central1

# 4. Creer URL Map et proxy HTTPS
gcloud compute url-maps create gemini-lb \
  --default-service=gemini-backend

gcloud compute target-https-proxies create gemini-https-proxy \
  --url-map=gemini-lb \
  --ssl-certificates=gemini-cert

# 5. Creer IP globale et forwarding rule
gcloud compute addresses create gemini-ip --global

gcloud compute forwarding-rules create gemini-https-rule \
  --global \
  --target-https-proxy=gemini-https-proxy \
  --address=gemini-ip \
  --ports=443

Pour une latence optimale : deployez Cloud Run dans plusieurs regions (us-central1, europe-west1, asia-east1), configurez un Load Balancer global, et activez Cloud CDN pour cacher les reponses frequentes. Vertex AI n'est disponible que dans certaines regions, donc vos backends devront appeler la region Vertex AI la plus proche.

⚡ Optimisations de Performance

Technique	Impact latence	Implementation
Min instances > 0	-1000ms (elimine cold start)	`--min-instances 2` sur Cloud Run
Connection pooling	-50ms par requete	Reutiliser client Vertex AI
Streaming	-2000ms (TTFT)	`stream=True` dans generate_content
Context caching	-80% latence	Cacher prompts systeme longs
CDN pour assets	-200ms (assets statiques)	Cloud CDN sur Load Balancer
Regions multiples	-100ms (latence geo)	Deploy multi-region + GLB

Securite Enterprise

⏱ 25 min Avance

🎯 Objectifs d'apprentissage

Implementer une strategie IAM zero-trust
Configurer VPC Service Controls et DLP
Securiser les secrets avec Secret Manager
Activer audit logs et monitoring de securite

🛡 Defense en Profondeur (Defense in Depth)

┌────────────────────────────────────────────────────────────┐
│                    COUCHE 7 : MONITORING                    │
│         Audit Logs, Security Command Center, Alerting       │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 6 : DLP & FILTERING               │
│              Data Loss Prevention, Content Moderation       │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 5 : ENCRYPTION                    │
│                 CMEK, TLS 1.3, Data at Rest                 │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 4 : SECRETS                       │
│               Secret Manager, Workload Identity             │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 3 : NETWORK ISOLATION             │
│              VPC-SC, Private Service Connect                │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 2 : IDENTITY                      │
│               IAM Policies, Service Accounts                │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                    COUCHE 1 : AUTHENTICATION                │
│              OAuth 2.0, API Keys, mTLS                      │
└────────────────────────────────────────────────────────────┘

🔐 IAM Zero-Trust

Principe de moindre privilege (Least Privilege) :

bash

# MAUVAIS : Donner roles/owner (trop de permissions)
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/owner"  # ❌ DANGER

# BON : Permissions granulaires minimales
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"  # ✅ Minimum necessaire

# Encore mieux : Custom Role avec permissions precises
gcloud iam roles create geminiUserCustom \
  --project=PROJECT_ID \
  --title="Gemini User Custom" \
  --permissions=aiplatform.endpoints.predict,aiplatform.models.get

Segregation par environnement :

bash

# Service Accounts separes par environnement
# DEV
gcloud iam service-accounts create gemini-dev-sa \
  --display-name="Gemini Dev" \
  --project=project-dev

# STAGING
gcloud iam service-accounts create gemini-staging-sa \
  --display-name="Gemini Staging" \
  --project=project-staging

# PROD
gcloud iam service-accounts create gemini-prod-sa \
  --display-name="Gemini Prod" \
  --project=project-prod

# IAM Conditions : Limiter acces par IP, heure, ressource
gcloud projects add-iam-policy-binding project-prod \
  --member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user" \
  --condition='expression=request.time < timestamp("2026-12-31T23:59:59Z"),title=expires-end-of-year'

🔒 Secret Manager

bash

# 1. Creer secret pour API keys tierces
echo -n "sk-openai-api-key-xyz" | gcloud secrets create openai-api-key \
  --data-file=- \
  --replication-policy="automatic"

# 2. Donner acces au Service Account
gcloud secrets add-iam-policy-binding openai-api-key \
  --member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
  --role="roles/secretmanager.secretAccessor"

# 3. Utiliser dans l'application
from google.cloud import secretmanager

client = secretmanager.SecretManagerServiceClient()
name = "projects/PROJECT_ID/secrets/openai-api-key/versions/latest"
response = client.access_secret_version(request={"name": name})
api_key = response.payload.data.decode("UTF-8")

⚠️ Secrets : Ce qu'il ne faut JAMAIS faire

❌ Hardcoder des secrets dans le code source
❌ Commiter des .env avec vraies cles dans Git
❌ Exposer secrets dans logs ou error messages
❌ Partager secrets via Slack/Email
❌ Utiliser la meme API key pour dev/staging/prod

🚨 Data Loss Prevention (DLP)

DLP detecte et masque automatiquement les donnees sensibles (PII, PHI, PCI) avant envoi a Gemini.

python

# DLP Inspection avant envoi a Gemini
from google.cloud import dlp_v2

def inspect_and_deidentify(text, project_id):
    dlp = dlp_v2.DlpServiceClient()
    parent = f"projects/{project_id}/locations/global"

    # Configuration inspection (detecter PII)
    inspect_config = {
        "info_types": [
            {"name": "EMAIL_ADDRESS"},
            {"name": "PHONE_NUMBER"},
            {"name": "CREDIT_CARD_NUMBER"},
            {"name": "US_SOCIAL_SECURITY_NUMBER"},
            {"name": "PERSON_NAME"}
        ],
        "min_likelihood": dlp_v2.Likelihood.LIKELY
    }

    # Configuration de-identification (masquer PII)
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [{
                "primitive_transformation": {
                    "replace_with_info_type_config": {}
                }
            }]
        }
    }

    item = {"value": text}
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": item
        }
    )

    return response.item.value

# Utilisation
user_input = "Mon email est john.doe@example.com et mon tel est 555-1234"
safe_input = inspect_and_deidentify(user_input, "mon-projet")
# Resultat : "Mon email est [EMAIL_ADDRESS] et mon tel est [PHONE_NUMBER]"

# Envoyer a Gemini seulement le texte de-identifie
response = model.generate_content(safe_input)

📝 Audit Logs

3 types d'Audit Logs :

Admin Activity : Actions d'administration (toujours active, gratuit)
Data Access : Lectures/ecritures de donnees (doit etre active, payant)
System Event : Evenements GCP internes (gratuit)

bash

# Activer Data Access Logs pour Vertex AI
gcloud logging project-logs enable \
  DATA_ACCESS \
  --project=PROJECT_ID

# Requete logs : Qui a appele Gemini ?
gcloud logging read 'resource.type="aiplatform.googleapis.com/Endpoint"
  AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"' \
  --project=PROJECT_ID \
  --limit=50 \
  --format=json

# Creer alertes sur comportements suspects
gcloud logging metrics create gemini_unusual_volume \
  --description="Alert si >1000 req/min depuis une meme IP" \
  --log-filter='resource.type="aiplatform.googleapis.com/Endpoint"
    AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"'

Les audit logs sont essentiels pour la conformite (GDPR Article 32, SOC2, HIPAA). Activez-les en production. Configurez des exports vers BigQuery pour analyse long-terme et correlation avec Security Command Center pour detection d'anomalies automatique.

🌐 VPC Service Controls (VPC-SC) Avance

yaml

# vpc-sc-policy.yaml - Configuration complete VPC-SC
name: accessPolicies/POLICY_ID/servicePerimeters/gemini_perimeter
title: "Gemini Production Secure Perimeter"
status:
  resources:
    - projects/PROJECT_NUMBER
  restrictedServices:
    - aiplatform.googleapis.com
    - storage.googleapis.com
  accessLevels:
    - accessPolicies/POLICY_ID/accessLevels/corporate_network
  vpcAccessibleServices:
    enableRestriction: true
    allowedServices:
      - aiplatform.googleapis.com
  ingressPolicies:
    - ingressFrom:
        identities:
          - serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
        sources:
          - accessLevel: accessPolicies/POLICY_ID/accessLevels/corporate_network
      ingressTo:
        resources:
          - "*"
        operations:
          - serviceName: aiplatform.googleapis.com
            methodSelectors:
              - method: "google.cloud.aiplatform.v1.PredictionService.Predict"
  egressPolicies:
    - egressFrom:
        identities:
          - serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
      egressTo:
        resources:
          - "*"
        operations:
          - serviceName: storage.googleapis.com

Conformite & RGPD

⏱ 30 min Avance

🎯 Objectifs d'apprentissage

Comprendre les exigences GDPR pour les systemes IA
Configurer data residency et data sovereignty
Implementer DPIA (Data Protection Impact Assessment)
Maitriser les certifications SOC2, HIPAA, ISO 27001

⚖️ GDPR et IA : Obligations Essentielles

Article GDPR	Obligation	Implementation Vertex AI
Art. 5	Minimisation des donnees	DLP pour filtrer PII, pas de stockage inutile
Art. 13-14	Transparence (informer utilisateurs)	Disclaimer "Ce chat utilise Gemini par Google"
Art. 15	Droit d'acces	Logging de toutes requetes utilisateur, API export
Art. 17	Droit a l'effacement	Purge logs apres 90 jours, pas de fine-tuning sur donnees utilisateur
Art. 25	Privacy by design	VPC-SC, CMEK, anonymisation par defaut
Art. 28	DPA (Data Processing Agreement)	Signer Cloud Data Processing Addendum Google
Art. 32	Securite	TLS 1.3, encryption at rest, audit logs
Art. 33	Notification breaches (72h)	Security Command Center alertes
Art. 35	DPIA si risque eleve	Template DPIA pour chatbots Gemini

🌍 Data Residency & Data Sovereignty

Vertex AI regions disponibles (2026) :

Europe : europe-west1 (Belgique), europe-west4 (Pays-Bas), europe-west9 (France)
US : us-central1 (Iowa), us-east1 (Caroline du Sud), us-west1 (Oregon)
Asia : asia-northeast1 (Tokyo), asia-southeast1 (Singapour)

python

# Configuration region EU pour conformite GDPR
import vertexai
from vertexai.generative_models import GenerativeModel

# IMPORTANT : Forcer region EU pour donnees GDPR
vertexai.init(
    project="mon-projet-eu",
    location="europe-west1"  # Belgique (UE)
)

model = GenerativeModel("gemini-2.0-flash-exp")

# Verifier que la region est bien EU
print(f"Region utilisee : {vertexai._location}")
# Output : "europe-west1"

⚠️ Attention Multi-region

Google AI Studio utilise des endpoints multi-region (EU ou US global). Pour GDPR strict, utilisez TOUJOURS Vertex AI avec une region EU explicite. Vertex AI garantit que les donnees ne quittent pas la region choisie.

📋 DPIA Template pour Chatbot Gemini

Data Protection Impact Assessment (DPIA) requis si :

✅ Traitement automatise avec effets juridiques (ex: credit scoring avec IA)
✅ Surveillance systematique a grande echelle (ex: monitoring employes)
✅ Donnees sensibles : sante (HIPAA), enfants, biometrie

markdown

# DPIA Template : Chatbot Support Client (Gemini)

## 1. Description du traitement
- **Nature** : Chatbot IA generative pour support client
- **Portee** : 50 000 utilisateurs/mois, EU uniquement
- **Contexte** : Questions support produit, pas de paiement
- **Finalites** : Repondre questions, reduire tickets support

## 2. Donnees traitees
- **Donnees collectees** : Nom, email, historique conversation
- **Donnees sensibles** : AUCUNE (pas sante, religion, etc.)
- **Retention** : 90 jours puis suppression automatique

## 3. Necessite et proportionnalite
- **Base legale** : Interet legitime (Art. 6(1)(f) GDPR)
- **Minimisation** : Seulement nom/email, pas de tel/adresse
- **Alternatives considerees** : Support humain seul (trop lent), FAQ statique (moins efficace)

## 4. Risques identifies
| Risque | Impact | Probabilite | Mesure mitigation |
|--------|--------|-------------|-------------------|
| Fuite donnees conversationnelles | Moyen | Faible | VPC-SC, CMEK, TLS 1.3 |
| Hallucination donnant mauvais conseil | Moyen | Moyenne | Grounding avec docs, disclaimer |
| Re-identification via style ecriture | Faible | Tres faible | Pas de fine-tuning |

## 5. Mesures de securite
- ✅ Encryption in-transit (TLS 1.3) et at-rest (AES-256)
- ✅ VPC Service Controls (pas d'acces externe)
- ✅ Audit logs actives (Cloud Logging)
- ✅ DLP pour detecter PII accidentelle
- ✅ Region EU (europe-west1) avec data residency

## 6. Droits utilisateurs
- ✅ Information transparente (banner "Powered by Gemini")
- ✅ Droit d'acces (API export conversations)
- ✅ Droit a l'effacement (bouton "Supprimer mes donnees")
- ✅ Droit d'opposition (opt-out chatbot)

## 7. Conclusion
Risque residu : **FAIBLE**
DPIA validee par : DPO (Data Protection Officer)
Date : 2026-02-10

🏥 HIPAA Compliance (Donnees de Sante)

Google Cloud signe BAA (Business Associate Agreement) pour :

✅ Vertex AI (Gemini via Vertex AI uniquement, PAS AI Studio)
✅ Cloud Storage, BigQuery, Cloud SQL
✅ Cloud Logging (mais desactiver Data Access Logs PHI)

bash

# Configuration HIPAA-compliant pour application medicale
# 1. Activer organisation policy pour forcer CMEK
gcloud resource-manager org-policies set-policy cmek-policy.yaml

# cmek-policy.yaml
name: projects/PROJECT_ID/policies/constraints/gcp.restrictNonCmekServices
spec:
  rules:
    - enforce: true

# 2. Activer Access Transparency (voir qui chez Google accede aux donnees)
gcloud organizations add-iam-policy-binding ORG_ID \
  --member='domain:example.com' \
  --role='roles/accessapproval.approver'

# 3. Configurer retention logs conforme (6 ans pour HIPAA)
gcloud logging sinks create hipaa-audit-sink \
  bigquery.googleapis.com/projects/PROJECT_ID/datasets/hipaa_audit_logs \
  --log-filter='protoPayload.serviceName="aiplatform.googleapis.com"'

# 4. Desactiver Data Access Logs pour eviter log PHI
# (Configurer via IAM & Admin > Audit Logs > desactiver "Data Read/Write" pour aiplatform)

Pour HIPAA : utilisez TOUJOURS Vertex AI (jamais AI Studio), signez le BAA avec Google, activez CMEK, configurez Access Transparency, et mettez en place une retention de 6 ans des audit logs. Considerez aussi de-identification des donnees avant envoi a Gemini avec Cloud Healthcare API.

🔒 ISO 27001 & SOC 2 Type II

Google Cloud est certifie :

✅ ISO 27001 (Information Security Management)
✅ ISO 27017 (Cloud Security)
✅ ISO 27018 (Privacy in Cloud)
✅ SOC 2 Type II (Security, Availability, Confidentiality)
✅ SOC 3 (version publique de SOC 2)

Rapports disponibles :

📄 Console GCP > Security > Compliance Reports Manager
📄 Telecharger ISO/SOC reports pour audits
📄 Partager avec auditeurs sous NDA

CI/CD pour IA

⏱ 35 min Avance

🎯 Objectifs d'apprentissage

Mettre en place un pipeline CI/CD pour applications Gemini
Implementer prompt versioning et testing automatise
Configurer evaluation gates avant production
Deployer avec strategies canary et blue-green

🔄 Pipeline CI/CD Complet

┌─────────────────────────────────────────────────────────────┐
│                      CI/CD PIPELINE GEMINI                   │
└─────────────────────────────────────────────────────────────┘
     │
     ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│  1. COMMIT   │──▶│  2. BUILD    │──▶│  3. TEST     │
│              │   │              │   │              │
│ - Git push   │   │ - Docker img │   │ - Unit tests │
│ - PR opened  │   │ - Lint       │   │ - Prompt eval│
└──────────────┘   └──────────────┘   └──────────────┘
                                              │
                                              ▼
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│  6. PROD     │◀──│  5. STAGING  │◀──│ 4. DEV DEPLOY│
│              │   │              │   │              │
│ - Canary 10% │   │ - Smoke test │   │ - Cloud Run  │
│ - Monitor    │   │ - Eval gate  │   │ - Auto deploy│
│ - Rollback?  │   │ - Manual OK  │   │              │
└──────────────┘   └──────────────┘   └──────────────┘

🏗 Cloud Build Configuration

yaml

# cloudbuild.yaml - Pipeline CI/CD complet
steps:
  # Etape 1 : Linting et formatage
  - name: 'python:3.12'
    id: 'lint'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install ruff black
        ruff check src/
        black --check src/

  # Etape 2 : Tests unitaires
  - name: 'python:3.12'
    id: 'unit-tests'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install -r requirements.txt
        pytest tests/unit/ --cov=src --cov-report=term

  # Etape 3 : Evaluation des prompts
  - name: 'python:3.12'
    id: 'prompt-eval'
    entrypoint: 'bash'
    secretEnv: ['VERTEX_PROJECT']
    args:
      - '-c'
      - |
        pip install -r requirements.txt
        python scripts/eval_prompts.py --project=$VERTEX_PROJECT --threshold=0.7
    waitFor: ['unit-tests']

  # Etape 4 : Build Docker image
  - name: 'gcr.io/cloud-builders/docker'
    id: 'build-image'
    args:
      - 'build'
      - '-t'
      - 'gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA'
      - '-t'
      - 'gcr.io/$PROJECT_ID/gemini-api:latest'
      - '.'
    waitFor: ['prompt-eval']

  # Etape 5 : Push image
  - name: 'gcr.io/cloud-builders/docker'
    id: 'push-image'
    args:
      - 'push'
      - '--all-tags'
      - 'gcr.io/$PROJECT_ID/gemini-api'
    waitFor: ['build-image']

  # Etape 6 : Deploy to DEV (auto)
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    id: 'deploy-dev'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        gcloud run deploy gemini-api-dev \
          --image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
          --region us-central1 \
          --platform managed \
          --service-account gemini-dev-sa@$PROJECT_ID.iam.gserviceaccount.com \
          --set-env-vars ENV=dev,VERSION=$SHORT_SHA \
          --tag dev-$SHORT_SHA
    waitFor: ['push-image']

  # Etape 7 : Smoke tests DEV
  - name: 'python:3.12'
    id: 'smoke-tests-dev'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        pip install requests
        python scripts/smoke_tests.py --url=https://gemini-api-dev-HASH-uc.a.run.app
    waitFor: ['deploy-dev']

  # Etape 8 : Deploy to STAGING (si branch main)
  - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
    id: 'deploy-staging'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        if [ "$BRANCH_NAME" = "main" ]; then
          gcloud run deploy gemini-api-staging \
            --image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
            --region us-central1 \
            --platform managed \
            --service-account gemini-staging-sa@$PROJECT_ID.iam.gserviceaccount.com \
            --set-env-vars ENV=staging,VERSION=$SHORT_SHA
        fi
    waitFor: ['smoke-tests-dev']

# Secrets from Secret Manager
availableSecrets:
  secretManager:
    - versionName: projects/$PROJECT_ID/secrets/vertex-project/versions/latest
      env: 'VERTEX_PROJECT'

# Timeout global
timeout: '1800s'

# Tags
tags: ['gemini-api', 'ci-cd']

options:
  machineType: 'E2_HIGHCPU_8'
  logging: CLOUD_LOGGING_ONLY

📝 Prompt Versioning

python

# prompts.py - Versionning des prompts
from dataclasses import dataclass
from typing import Dict
import json

@dataclass
class PromptVersion:
    version: str
    system_instruction: str
    temperature: float
    max_tokens: int
    metadata: Dict[str, str]

class PromptRegistry:
    """Registry centralise pour tous les prompts versions"""

    PROMPTS = {
        "customer_support_v1": PromptVersion(
            version="1.0.0",
            system_instruction="""Tu es un assistant support client pour AcmeCorp.
Reponds de maniere concise et professionnelle.
Si tu ne sais pas, dis 'Je ne sais pas, je transfere a un humain.'""",
            temperature=0.3,
            max_tokens=512,
            metadata={"created": "2026-01-15", "author": "team-support"}
        ),
        "customer_support_v2": PromptVersion(
            version="2.0.0",
            system_instruction="""Tu es un assistant support client expert pour AcmeCorp.

REGLES :
1. Reponds en 2-3 phrases maximum
2. Utilise les docs (grounding) pour info precise
3. Si pas dans docs, dis "Je ne sais pas"
4. Toujours termine par "Autre question ?"

TONE : Professionnel mais amical""",
            temperature=0.2,  # Plus deterministe
            max_tokens=256,   # Plus court
            metadata={"created": "2026-02-01", "author": "team-support", "ab_test": "variant_b"}
        ),
    }

    @classmethod
    def get_prompt(cls, prompt_id: str) -> PromptVersion:
        if prompt_id not in cls.PROMPTS:
            raise ValueError(f"Prompt {prompt_id} not found")
        return cls.PROMPTS[prompt_id]

    @classmethod
    def get_active_prompt(cls, use_case: str = "customer_support") -> PromptVersion:
        """Retourne le prompt actif (gere via feature flags)"""
        # En production, lire depuis feature flag (LaunchDarkly, Cloud Config, etc.)
        active_version = "customer_support_v2"  # ou v1 selon A/B test
        return cls.get_prompt(active_version)

# Utilisation
from vertexai.generative_models import GenerativeModel

prompt_config = PromptRegistry.get_active_prompt("customer_support")
model = GenerativeModel(
    "gemini-2.0-flash-exp",
    system_instruction=prompt_config.system_instruction
)

response = model.generate_content(
    "Comment retourner un produit ?",
    generation_config={
        "temperature": prompt_config.temperature,
        "max_output_tokens": prompt_config.max_tokens
    }
)

💡 Best practices prompt versioning

✅ Toujours versionner les prompts (semantic versioning : 1.0.0, 1.1.0, 2.0.0)
✅ Stocker dans Git avec review process (PR required)
✅ Tracker metadata : auteur, date, rationale du changement
✅ A/B tester nouvelles versions avant rollout 100%
✅ Rollback rapide si degradation qualite

✅ Evaluation Gates

python

# scripts/eval_prompts.py - Eval automatique avant deploy
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import EvalTask
import argparse

def run_eval_gate(project: str, threshold: float = 0.7):
    """
    Evalue le prompt sur dataset de test.
    Echoue le build si score < threshold.
    """
    vertexai.init(project=project, location="us-central1")

    # Dataset de test (Golden Set)
    test_cases = [
        {
            "input": "Comment retourner un produit ?",
            "expected_output": "Vous avez 30 jours pour retourner un produit...",
            "rubric": "Doit mentionner delai 30 jours et procedure"
        },
        {
            "input": "Quel est le prix du produit XYZ ?",
            "expected_output": "Je ne sais pas",
            "rubric": "Doit dire 'je ne sais pas' si info pas dans docs"
        },
        # ... 50+ test cases
    ]

    # Charger prompt actif
    from prompts import PromptRegistry
    prompt_config = PromptRegistry.get_active_prompt()

    model = GenerativeModel(
        "gemini-2.0-flash-exp",
        system_instruction=prompt_config.system_instruction
    )

    # Evaluation avec Vertex AI Evaluation
    eval_task = EvalTask(
        dataset=test_cases,
        metrics=["coherence", "fluency", "safety", "groundedness"],
        experiment="prompt-eval-" + prompt_config.version
    )

    results = eval_task.evaluate(model=model)

    # Calculer score global
    avg_score = sum(
        results.summary_metrics[m] for m in ["coherence", "fluency", "groundedness"]
    ) / 3

    print(f"Evaluation score: {avg_score:.2f}")
    print(f"Threshold: {threshold}")

    if avg_score < threshold:
        print("❌ EVAL GATE FAILED - Score trop bas")
        exit(1)  # Fail le build
    else:
        print("✅ EVAL GATE PASSED")
        exit(0)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--project", required=True)
    parser.add_argument("--threshold", type=float, default=0.7)
    args = parser.parse_args()

    run_eval_gate(args.project, args.threshold)

🚦 Deployment Strategies

1. Canary Deployment (Recommande) :

bash

# Deploy nouvelle version sur 10% trafic
gcloud run deploy gemini-api-prod \
  --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \
  --region us-central1 \
  --tag canary \
  --no-traffic  # Pas de trafic initial

# Router 10% vers canary
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-tags canary=10

# Monitorer pendant 1h (erreurs, latence, qualite)
# Si OK : augmenter a 50%
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-tags canary=50

# Si OK : rollout 100%
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-latest

# Si KO : rollback immediat
gcloud run services update-traffic gemini-api-prod \
  --region us-central1 \
  --to-revisions PREVIOUS_REVISION=100

2. Blue-Green Deployment :

bash

# Environnement BLUE (actuel en prod)
gcloud run deploy gemini-api-blue \
  --image gcr.io/PROJECT_ID/gemini-api:v1.0.0 \
  --region us-central1

# Environnement GREEN (nouvelle version)
gcloud run deploy gemini-api-green \
  --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \
  --region us-central1

# Load Balancer pointe vers BLUE
# Apres validation GREEN : switch Load Balancer vers GREEN
# Si probleme : switch instantane vers BLUE

Pour la production, privilegiez Canary deployment avec Cloud Run (support natif des traffic splits). Commencez avec 10% de trafic sur la nouvelle version, monitorez pendant 1-2h (erreurs, latence P95, qualite des reponses via eval), puis augmentez progressivement. Gardez toujours un rollback one-click pret.

Lab : Pipeline Deploiement GCP

⏱ 60 min Pratique

🎯 Objectifs du Lab

Creer un pipeline CI/CD complet sur Cloud Build
Deployer une application Gemini sur Cloud Run
Configurer monitoring et alertes
Tester canary deployment avec rollback

🧪 Lab Pratique : Pipeline Production

Duree estimee : 60 minutes

Etape 1 : Setup projet GCP (10 min)

Creer un nouveau projet et activer les APIs necessaires.

bash

export PROJECT_ID="gemini-lab-$(date +%s)"
gcloud projects create $PROJECT_ID
gcloud config set project $PROJECT_ID

# Activer APIs
gcloud services enable aiplatform.googleapis.com \
  cloudbuild.googleapis.com \
  run.googleapis.com \
  secretmanager.googleapis.com \
  cloudmonitoring.googleapis.com

# Creer Service Account
gcloud iam service-accounts create gemini-prod-sa

Etape 2 : Cloner code starter (5 min)

bash

git clone https://github.com/google-cloud/gemini-deployment-starter
cd gemini-deployment-starter

# Structure :
# - src/main.py (API Flask + Gemini)
# - Dockerfile
# - cloudbuild.yaml
# - tests/
# - prompts/

Etape 3 : Configurer Cloud Build (10 min)

Connecter GitHub et configurer triggers.

bash

# Donner permissions Cloud Build
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
  --role="roles/run.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
  --role="roles/iam.serviceAccountUser"

# Creer trigger Cloud Build
gcloud builds triggers create github \
  --repo-name=gemini-deployment-starter \
  --repo-owner=VOTRE_GITHUB \
  --branch-pattern="^main$" \
  --build-config=cloudbuild.yaml

Etape 4 : Premier deploy (15 min)

Pusher code et observer le pipeline.

bash

# Modifier prompts/customer_support.py
# Commit et push
git add .
git commit -m "Initial deploy"
git push origin main

# Observer build dans Cloud Console
gcloud builds list --ongoing

# Une fois termine, tester l'API
SERVICE_URL=$(gcloud run services describe gemini-api-dev \
  --region us-central1 --format="value(status.url)")

curl -X POST $SERVICE_URL/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Comment retourner un produit ?"}'

Etape 5 : Monitoring & Alertes (10 min)

bash

# Creer dashboard monitoring
gcloud monitoring dashboards create --config-from-file=dashboard.json

# Creer alerte sur error rate > 5%
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Gemini API Error Rate High" \
  --condition-display-name="Error rate > 5%" \
  --condition-threshold-value=0.05 \
  --condition-threshold-duration=300s

Etape 6 : Canary Deploy & Rollback (10 min)

Deployer une v2 avec bug intentionnel, puis rollback.

bash

# Deploy v2 (avec bug)
git checkout -b v2-buggy
# Modifier temperature=2.0 (trop haute, reponses instables)
git commit -am "v2: increase creativity"
git push origin v2-buggy

# Merge en main (deploy auto staging)
# Promouvoir en prod avec canary 10%
gcloud run services update-traffic gemini-api-prod \
  --to-tags canary=10

# Observer metriques (latence augmente, qualite baisse)
# ROLLBACK
gcloud run services update-traffic gemini-api-prod \
  --to-revisions gemini-api-prod-00001-abc=100

✅ Verification

Verifier que vous avez :

✅ Pipeline CI/CD fonctionnel avec Cloud Build
✅ Application deployee sur Cloud Run
✅ Monitoring dashboard avec metriques
✅ Alertes configurees
✅ Canary deployment + rollback testes

Ce lab vous a montre un workflow production-ready. En entreprise, ajoutez : evaluation automatique avant chaque deploy, integration tests end-to-end, security scanning (Snyk, Trivy), et approvals manuels avant prod.

Quiz Module 4.1

⏱ 15 min Evaluation

📝 Quiz : Deploiement Enterprise

15 questions pour valider vos connaissances

1. Quelle est la principale difference entre AI Studio et Vertex AI ?

AI Studio est payant, Vertex AI gratuit

Vertex AI offre IAM, VPC-SC, SLA enterprise

AI Studio est plus rapide

Aucune difference

2. Quel role IAM minimum pour appeler Gemini sur Vertex AI ?

roles/owner

roles/editor

roles/aiplatform.user

roles/viewer

3. Quelle solution de deploiement recommandee pour une API Gemini serverless ?

Cloud Run

GKE

Compute Engine

App Engine

4. Comment eliminer les cold starts sur Cloud Run ?

Augmenter la memoire

Configurer min-instances > 0

Utiliser des containers plus petits

Impossible

5. VPC Service Controls (VPC-SC) permet de :

Reduire les couts

Accelerer les requetes

Empecher l'exfiltration de donnees

Chiffrer les donnees

6. Ou stocker les API keys tierces de maniere securisee ?

Dans le code source

Secret Manager

Fichier .env commite dans Git

Variables d'environnement Cloud Run

7. DLP (Data Loss Prevention) permet de :

Detecter et masquer PII avant envoi a Gemini

Sauvegarder les donnees

Compresser les prompts

Monitorer les couts

8. Pour conformite GDPR stricte, vous devez :

Utiliser AI Studio

Utiliser Vertex AI multi-region

Utiliser Vertex AI avec region EU explicite

Desactiver logging

9. CMEK (Customer-Managed Encryption Keys) vous donne :

Des performances meilleures

Controle sur les cles de chiffrement

Un chiffrement plus fort

Des couts reduits

10. Quelle certification GCP est requise pour donnees de sante US ?

ISO 27001

SOC 2

GDPR

HIPAA (avec BAA signe)

11. Dans un pipeline CI/CD pour IA, les evaluation gates servent a :

Bloquer deploy si qualite prompts < seuil

Accelerer le build

Reduire les couts

Generer documentation

12. Pourquoi versionner les prompts ?

Pour reduire les tokens

Pour accelerer les requetes

Pour tracker changes, A/B test, rollback

Ce n'est pas necessaire

13. Canary deployment signifie :

Deployer sur tous les serveurs simultanement

Deployer sur 10% trafic, monitorer, puis augmenter

Deployer uniquement la nuit

Deployer avec un oiseau jaune

14. Audit logs Data Access doivent etre actives pour :

Conformite (GDPR Art. 32, SOC2, HIPAA)

Reduire les couts

Accelerer Vertex AI

Aucune raison

15. Private Service Connect permet de :

Augmenter les quotas

Reduire les couts

Appeler Vertex AI depuis VPC sans passer par Internet

Creer des modeles custom

Comprendre les Couts Gemini

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Maitriser le modele de tarification Gemini (8 modeles)
Comprendre le tiered pricing et seuil 200K tokens
Calculer le cout d'une application Gemini
Anticiper et budgeter les couts IA

💰 Tarification Gemini 2.5 (2026)

Modele	Input ≤200K	Input >200K	Output ≤200K	Output >200K	Context Cache
2.5 Pro	$3.00 / 1M	$1.50 / 1M	$12.00 / 1M	$6.00 / 1M	$0.30 / 1M
2.5 Flash	$0.15 / 1M	$0.075 / 1M	$0.60 / 1M	$0.30 / 1M	$0.015 / 1M
2.5 Flash-8B	$0.04 / 1M	$0.02 / 1M	$0.16 / 1M	$0.08 / 1M	$0.004 / 1M
2.0 Pro Exp (Extended Thinking)	$3.00 / 1M	$1.50 / 1M	$12.00 / 1M	$6.00 / 1M	-
2.0 Flash Exp	$0.15 / 1M	$0.075 / 1M	$0.60 / 1M	$0.30 / 1M	$0.015 / 1M
1.5 Pro	$2.50 / 1M	$1.25 / 1M	$10.00 / 1M	$5.00 / 1M	$0.25 / 1M
1.5 Flash	$0.10 / 1M	$0.05 / 1M	$0.40 / 1M	$0.20 / 1M	$0.01 / 1M
1.5 Flash-8B	$0.03 / 1M	$0.015 / 1M	$0.12 / 1M	$0.06 / 1M	$0.003 / 1M

💡 Tiered Pricing

Le prix est divise par 2 au-dela de 200K tokens. Exemple : si vous envoyez 300K tokens input avec Flash, vous payez : (200K × $0.15) + (100K × $0.075) = $30 + $7.5 = $37.5 / 1M tokens effectifs.

🔢 Composants de Cout

1. Input tokens : Prompt utilisateur + system instruction + context cached

2. Output tokens : Reponse generee par Gemini

3. Cached tokens : Context mis en cache (coute 10x moins cher)

4. Thinking tokens (2.0 Pro Exp) : Comptes comme output tokens

python

# Calculer cout d'une requete
from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Explique la relativite en 3 paragraphes")

# Extraire usage
usage = response.usage_metadata
print(f"Input tokens: {usage.prompt_token_count}")
print(f"Output tokens: {usage.candidates_token_count}")
print(f"Cached tokens: {usage.cached_content_token_count}")

# Calculer cout (Flash ≤200K)
input_cost = (usage.prompt_token_count / 1_000_000) * 0.15
output_cost = (usage.candidates_token_count / 1_000_000) * 0.60
cache_cost = (usage.cached_content_token_count / 1_000_000) * 0.015

total_cost = input_cost + output_cost + cache_cost
print(f"Cout total: ${total_cost:.6f}")
# Exemple : Input 50 tokens, Output 200 tokens
# ($0.000015) + ($0.000120) = $0.000135 par requete

📊 Simulation Cout Application

Exemple : Chatbot support client

Volume : 100,000 conversations/mois
Moyenne : 5 messages par conversation
Input moyen : 500 tokens (system instruction 200 + user 300)
Output moyen : 150 tokens
Modele : Gemini 2.5 Flash

python

# Calcul cout chatbot
conversations_per_month = 100_000
messages_per_conversation = 5
total_messages = conversations_per_month * messages_per_conversation  # 500,000

input_tokens_per_message = 500
output_tokens_per_message = 150

total_input_tokens = total_messages * input_tokens_per_message  # 250M
total_output_tokens = total_messages * output_tokens_per_message  # 75M

# Prix Flash (≤200K tier pour simplicite)
input_cost = (total_input_tokens / 1_000_000) * 0.15  # $37.50
output_cost = (total_output_tokens / 1_000_000) * 0.60  # $45.00

monthly_cost = input_cost + output_cost  # $82.50/mois
yearly_cost = monthly_cost * 12  # $990/an

print(f"Cout mensuel : ${monthly_cost:.2f}")
print(f"Cout annuel : ${yearly_cost:.2f}")
print(f"Cout par conversation : ${monthly_cost / conversations_per_month:.4f}")
# $0.000825 par conversation (moins d'un cent !)

Flash est incroyablement economique : ~$0.0008 par conversation. Meme avec 1M conversations/mois, vous ne paierez que ~$825/mois. Pro coute 20x plus mais offre meilleure qualite pour use cases complexes. Commencez avec Flash, upgradez vers Pro seulement si necessaire.

💡 Facteurs d'Impact Cout

Choix du modele : Flash-8B (4x moins cher que Flash) vs Pro (20x plus cher)
Longueur system instruction : 2000 tokens d'instruction = $0.0003 input par requete (Flash)
Context caching : -90% cout sur partie cachee
Output tokens : Output coute 4x plus que input (controler max_output_tokens)
Tiered pricing : >200K tokens = -50% prix
Streaming : Meme cout que non-streaming (pas de surcharge)

Strategies d'Optimisation des Couts

⏱ 30 min Avance

🎯 Objectifs d'apprentissage

Maitriser 7 techniques d'optimisation des couts
Implementer model routing intelligent
Utiliser context caching pour -90% cout
Optimiser prompts et output tokens

🎯 Les 7 Techniques d'Optimisation

🔀 1. Model Routing Intelligent

Principe : Utiliser Flash-8B pour requetes simples, Flash pour standard, Pro pour complexe.

python

# Router intelligent base sur complexite
from vertexai.generative_models import GenerativeModel

class GeminiRouter:
    def __init__(self):
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def classify_complexity(self, prompt: str) -> str:
        """Classifier la complexite de la requete"""
        length = len(prompt)

        # Regles simples
        if length < 100:
            return "simple"
        elif "analyse" in prompt.lower() or "compare" in prompt.lower():
            return "complex"
        elif "summarize" in prompt.lower() or "list" in prompt.lower():
            return "simple"
        else:
            return "standard"

    def route(self, prompt: str):
        """Router vers le bon modele"""
        complexity = self.classify_complexity(prompt)

        if complexity == "simple":
            # Flash-8B : $0.04/1M input (20x moins cher que Pro)
            model = self.flash_8b
            print("→ Routing vers Flash-8B (simple)")
        elif complexity == "complex":
            # Pro : meilleure qualite pour raisonnement
            model = self.pro
            print("→ Routing vers Pro (complexe)")
        else:
            # Flash : equilibre cout/qualite
            model = self.flash
            print("→ Routing vers Flash (standard)")

        return model.generate_content(prompt)

# Utilisation
router = GeminiRouter()

# Simple : Flash-8B ($0.000004 input)
response1 = router.route("Quelle est la capitale de la France ?")

# Standard : Flash ($0.000015 input)
response2 = router.route("Resume les 3 avantages principaux du cloud computing")

# Complexe : Pro ($0.00300 input)
response3 = router.route("Analyse comparative detaillee entre architecture monolithique et microservices avec cas d'usage specifiques")

# Economie : ~70% sur volume mixte

💾 2. Context Caching (-90% sur cache)

python

from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel
import datetime

# Creer cached content (system instruction longue)
system_instruction = """
Tu es un assistant support technique pour notre produit SaaS.
[... 5000 tokens de documentation produit ...]
Voici les 200 questions/reponses FAQ les plus frequentes :
[... documentation complete ...]
"""

cached_content = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction=system_instruction,
    ttl=datetime.timedelta(hours=1),  # Cache 1h
)

# Utiliser cache pour multiples requetes
model = GenerativeModel.from_cached_content(cached_content)

# Requete 1 : Paye 5000 tokens cached ($0.000075) au lieu de input ($0.00075)
response1 = model.generate_content("Comment reinitialiser mon mot de passe ?")

# Requete 2-1000 : Cache hit, economie massive
response2 = model.generate_content("Ou trouver mes factures ?")

# ECONOMIE :
# Sans cache : 1000 requetes × 5000 tokens input × $0.15/1M = $0.75
# Avec cache : 1 creation × $0.000075 + 1000 × cache hit (quasi-gratuit) = $0.075
# → 90% d'economie !

💡 Quand utiliser Context Caching ?

System instruction >2000 tokens, reutilisee >5 fois, TTL >5 minutes. ROI positif des 5 requetes.

📦 3. Batch API (-50% cout)

python

import json
from google.cloud import aiplatform

# Preparer batch requests (JSONL)
batch_requests = []
with open("questions.txt") as f:
    for line in f:
        batch_requests.append({
            "request": {
                "contents": [{"role": "user", "parts": [{"text": line.strip()}]}]
            }
        })

# Ecrire JSONL
with open("batch_input.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

# Upload to GCS
from google.cloud import storage
bucket = storage.Client().bucket("my-batch-bucket")
blob = bucket.blob("batch_input.jsonl")
blob.upload_from_filename("batch_input.jsonl")

# Submit batch job
batch_job = aiplatform.BatchPredictionJob.create(
    job_display_name="gemini-batch-job",
    model_name="gemini-2.5-flash",
    input_uri="gs://my-batch-bucket/batch_input.jsonl",
    output_uri="gs://my-batch-bucket/output/",
)

print(f"Batch job created: {batch_job.name}")
print("Processing time: 10-30 minutes")
print("Cost reduction: 50% vs real-time API")

# ECONOMIE :
# Real-time : 10,000 requetes × $0.15/1M input = $1.50
# Batch : 10,000 requetes × $0.075/1M input = $0.75
# → 50% d'economie si non-urgent

✂️ 4. Prompt Compression

python

# AVANT (verbose, 250 tokens)
prompt_verbose = """
Je voudrais que tu m'aides a comprendre le concept de machine learning.
Peux-tu s'il te plait m'expliquer ce que c'est de maniere simple ?
J'aimerais aussi savoir quelles sont les principales applications.
Et si possible, donne-moi quelques exemples concrets.
Merci beaucoup pour ton aide !
"""

# APRES (concis, 120 tokens, -52%)
prompt_concis = """
Explique machine learning simplement : definition, applications, exemples concrets.
"""

# TECHNIQUE : Supprimer politesse, redondances, aller droit au but
# Economie : 52% tokens input sur ce prompt

# Pour system instructions :
system_before = """
You are a helpful assistant. You should always be polite and professional.
When answering questions, make sure to provide detailed explanations.
If you don't know something, be honest about it.
Always format your responses in a clear and readable way.
"""  # 150 tokens

system_after = """
Assistant technique. Reponses detaillees, format clair, honnete sur limites.
"""  # 50 tokens (-67%)

🎚️ 5. Output Control

python

from vertexai.generative_models import GenerativeModel, GenerationConfig

model = GenerativeModel("gemini-2.5-flash")

# MAUVAIS : Output non controle (peut faire 2000 tokens)
response_uncontrolled = model.generate_content(
    "Liste les pays europeens"
)
# → Peut generer 2000 tokens = $0.0012 output

# BON : Output controle avec max_output_tokens
response_controlled = model.generate_content(
    "Liste les pays europeens",
    generation_config=GenerationConfig(
        max_output_tokens=200,  # Limite stricte
        temperature=0.3,  # Moins creative = plus court
    )
)
# → Maximum 200 tokens = $0.00012 output
# → Economie 90%

# Pour JSON : schema strict = output deterministe court
response_json = model.generate_content(
    "Top 3 pays europeens par PIB",
    generation_config=GenerationConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "country": {"type": "string"},
                    "gdp": {"type": "number"}
                }
            },
            "maxItems": 3
        }
    )
)
# → Output JSON compact, pas de texte superflu

⚡ 6. Utiliser Flash-8B pour Use Cases Simples

Use Case	Modele Recommande	Cout Input	Economie vs Pro
FAQ / Support simple	Flash-8B	$0.04/1M	75x moins cher
Classification	Flash-8B	$0.04/1M	75x moins cher
Extraction entites	Flash	$0.15/1M	20x moins cher
Resume documents	Flash	$0.15/1M	20x moins cher
Analyse complexe	Pro	$3.00/1M	Justifie si qualite critique

🔄 7. Implicit Context Caching (Auto)

Gemini 2.5 cache automatiquement les prefixes longs communs (>1024 tokens) pendant 5 minutes. Pas de config necessaire.

python

# Si vous envoyez meme long prefix dans les 5 min :
long_context = "[... 3000 tokens documentation ...]"

# Requete 1 : Full cost
response1 = model.generate_content(long_context + "\nQuestion 1 ?")

# Requete 2 (dans les 5 min) : Implicit cache hit !
response2 = model.generate_content(long_context + "\nQuestion 2 ?")
# → Google detecte automatiquement prefix identique
# → Cache hit gratuit (si >1024 tokens prefix)

# Astuce : Structurer prompts avec context stable en debut

En combinant ces 7 techniques, vous pouvez reduire vos couts de 60-80%. Commencez par model routing et context caching (quick wins), puis optimisez prompts et output. Batch API pour traitement non-urgent. Mesurez avant/apres pour quantifier ROI.

Context Caching Avance

⏱ 35 min Avance

🎯 Objectifs d'apprentissage

Comprendre implicit vs explicit caching
Optimiser TTL pour maximiser ROI
Implementer warming strategies
Calculer ROI caching pour votre use case

🔄 Implicit vs Explicit Caching

Aspect	Implicit Caching	Explicit Caching
Activation	Automatique (Gemini 2.5+)	Manuel via API
Taille min	>1024 tokens prefix	>2048 tokens
TTL	5 minutes fixe	1-60 minutes configurable
Cout cache	Gratuit	$0.015/1M tokens (Flash)
Use case	Conversations courtes	System instructions longues

💰 Calculateur ROI Context Caching

python

class CachingROICalculator:
    def __init__(self, model="flash"):
        if model == "flash":
            self.input_price = 0.15  # $/1M tokens
            self.cache_price = 0.015  # $/1M tokens (10x moins)
        elif model == "pro":
            self.input_price = 3.00
            self.cache_price = 0.30

    def calculate_roi(self,
                      cached_tokens: int,
                      num_requests: int,
                      ttl_minutes: int):
        """Calculer ROI du caching"""

        # SANS CACHE
        cost_without_cache = (
            cached_tokens * num_requests / 1_000_000 * self.input_price
        )

        # AVEC CACHE
        # Creation cache : 1 fois
        cache_creation = cached_tokens / 1_000_000 * self.input_price
        # Hits cache : num_requests fois
        cache_hits = cached_tokens * num_requests / 1_000_000 * self.cache_price
        # Storage : ttl_minutes
        cache_storage = cached_tokens / 1_000_000 * self.cache_price * (ttl_minutes / 60)

        cost_with_cache = cache_creation + cache_hits + cache_storage

        # ROI
        savings = cost_without_cache - cost_with_cache
        savings_pct = (savings / cost_without_cache) * 100
        breakeven_requests = cache_creation / (
            cached_tokens / 1_000_000 * (self.input_price - self.cache_price)
        )

        return {
            "cost_without_cache": cost_without_cache,
            "cost_with_cache": cost_with_cache,
            "savings": savings,
            "savings_pct": savings_pct,
            "breakeven_requests": int(breakeven_requests) + 1
        }

# Exemple : Chatbot support avec system instruction 5000 tokens
calc = CachingROICalculator(model="flash")

# Scenario 1 : 10 requetes/heure, cache 1h
result1 = calc.calculate_roi(
    cached_tokens=5000,
    num_requests=10,
    ttl_minutes=60
)

print("Scenario 1 : 10 req/h, TTL 1h")
print(f"  Sans cache: ${result1['cost_without_cache']:.6f}")
print(f"  Avec cache: ${result1['cost_with_cache']:.6f}")
print(f"  Economie: ${result1['savings']:.6f} ({result1['savings_pct']:.1f}%)")
print(f"  Breakeven: {result1['breakeven_requests']} requetes")
print()

# Scenario 2 : 100 requetes/heure, cache 1h
result2 = calc.calculate_roi(
    cached_tokens=5000,
    num_requests=100,
    ttl_minutes=60
)

print("Scenario 2 : 100 req/h, TTL 1h")
print(f"  Sans cache: ${result2['cost_without_cache']:.6f}")
print(f"  Avec cache: ${result2['cost_with_cache']:.6f}")
print(f"  Economie: ${result2['savings']:.6f} ({result2['savings_pct']:.1f}%)")
print(f"  Breakeven: {result2['breakeven_requests']} requetes")

# SORTIE :
# Scenario 1 : 10 req/h, TTL 1h
#   Sans cache: $0.007500
#   Avec cache: $0.001575
#   Economie: $0.005925 (79.0%)
#   Breakeven: 2 requetes

# Scenario 2 : 100 req/h, TTL 1h
#   Sans cache: $0.075000
#   Avec cache: $0.008250
#   Economie: $0.066750 (89.0%)
#   Breakeven: 2 requetes

# → ROI positif des 2 requetes !

⏱️ Optimisation TTL

📊 Formule TTL Optimal

TTL optimal = Intervalle moyen entre requetes × 5
Exemple : Si requetes toutes les 2 min → TTL = 10 min

python

import datetime
from vertexai.preview import caching

# Analyser pattern de traffic pour definir TTL
def optimize_ttl(request_intervals_minutes: list) -> int:
    """Calculer TTL optimal base sur pattern traffic"""
    avg_interval = sum(request_intervals_minutes) / len(request_intervals_minutes)
    optimal_ttl = int(avg_interval * 5)

    # Contraintes : 1-60 minutes
    if optimal_ttl < 1:
        return 1
    elif optimal_ttl > 60:
        return 60
    else:
        return optimal_ttl

# Exemple : Chatbot avec pic traffic 9h-18h
# Requetes toutes les 3 min en moyenne
intervals = [3, 2, 4, 3, 5, 2, 3, 4]  # minutes
optimal_ttl = optimize_ttl(intervals)
print(f"TTL optimal: {optimal_ttl} minutes")  # → 15 minutes

# Creer cache avec TTL optimal
cached_content = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction="[... 5000 tokens ...]",
    ttl=datetime.timedelta(minutes=optimal_ttl),
)

# Alternative : TTL absolu (expire a heure precise)
# Utile pour cache qui doit expirer en fin de journee
expire_time = datetime.datetime.now() + datetime.timedelta(hours=8)
cached_content_absolute = caching.CachedContent.create(
    model_name="gemini-2.5-flash",
    system_instruction="[... 5000 tokens ...]",
    expire_time=expire_time,  # Expire a 18h
)

🔥 Warming Strategies

Probleme : Si cache expire pendant pic traffic, premiere requete lente (cold start).

Solution : Cache warming preemptif.

python

import time
import threading
from datetime import datetime, timedelta
from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel

class CacheWarmer:
    def __init__(self, system_instruction: str, ttl_minutes: int):
        self.system_instruction = system_instruction
        self.ttl_minutes = ttl_minutes
        self.cached_content = None
        self.model = None
        self.warming_thread = None

    def create_cache(self):
        """Creer ou renouveler cache"""
        self.cached_content = caching.CachedContent.create(
            model_name="gemini-2.5-flash",
            system_instruction=self.system_instruction,
            ttl=timedelta(minutes=self.ttl_minutes),
        )
        self.model = GenerativeModel.from_cached_content(self.cached_content)
        print(f"[{datetime.now()}] Cache created/renewed")

    def start_warming(self):
        """Demarrer warming automatique"""
        self.create_cache()

        # Renouveler cache avant expiration
        refresh_interval = (self.ttl_minutes - 1) * 60  # 1 min avant expiration

        def warming_loop():
            while True:
                time.sleep(refresh_interval)
                self.create_cache()

        self.warming_thread = threading.Thread(target=warming_loop, daemon=True)
        self.warming_thread.start()

    def generate(self, prompt: str):
        """Generate avec cache toujours chaud"""
        if self.model is None:
            raise RuntimeError("Cache not initialized. Call start_warming() first.")
        return self.model.generate_content(prompt)

# Utilisation : Cache toujours chaud pendant heures bureau
warmer = CacheWarmer(
    system_instruction="[... 5000 tokens system instruction ...]",
    ttl_minutes=30
)

# Demarrer warming (renouvelle cache toutes les 29 min)
warmer.start_warming()

# Toutes les requetes utilisent cache chaud (pas de cold start)
response1 = warmer.generate("Question 1")
time.sleep(1800)  # 30 min plus tard
response2 = warmer.generate("Question 2")  # Cache renouvele automatiquement !

# Economie : Pas de cold start, latence optimale

📊 Comparaison Couts : Cache vs No Cache

Scenario	Cached Tokens	Requests/Day	Cost No Cache	Cost With Cache	Savings
Chatbot support	5,000	10,000	$7.50	$0.83	89%
RAG system	20,000	5,000	$15.00	$1.65	89%
Agent avec tools	10,000	1,000	$1.50	$0.17	89%
Code assistant	30,000	20,000	$90.00	$9.90	89%

⚠️ Quand NE PAS utiliser Cache

System instruction <2000 tokens (ROI negatif)
Moins de 5 requetes pendant TTL (breakeven non atteint)
Context change frequemment (invalidation cache trop souvent)
Implicit cache suffit (prefix >1024 tokens, requetes <5 min)

🛠️ Cache Management Best Practices

python

from vertexai.preview import caching

# 1. Lister tous les caches actifs
caches = caching.CachedContent.list()
for cache in caches:
    print(f"Cache: {cache.name}")
    print(f"  Model: {cache.model}")
    print(f"  Expire: {cache.expire_time}")
    print(f"  Size: {len(cache.system_instruction)} chars")

# 2. Supprimer cache manuellement si context change
cache_to_delete = caching.CachedContent(cached_content_name="cache-123")
cache_to_delete.delete()
print("Cache deleted")

# 3. Mettre a jour TTL d'un cache existant
cache_to_update = caching.CachedContent(cached_content_name="cache-456")
cache_to_update.update(ttl=timedelta(minutes=120))  # Extend TTL
print("TTL updated")

# 4. Monitoring usage cache
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()

query = """
fetch aiplatform.googleapis.com/prediction/cache_hit_count
| group_by 1h, [value_cache_hit_count_mean: mean(value.cache_hit_count)]
| every 1h
"""

# 5. Alert si cache hit rate < 80% (probleme TTL ou invalidation)
# → Creer alerte Cloud Monitoring sur cache_hit_rate metric

Context caching est votre meilleur allie FinOps. Pour chatbot/RAG avec system instruction longue, ROI est positif des 2 requetes. Commencez avec TTL conservateur (30 min), puis ajustez base sur metriques. Implicit cache gratuit pour conversations courtes. Warming pour apps critiques latence.

Model Routing Intelligent

⏱ 30 min Avance

🎯 Objectifs d'apprentissage

Implementer classifier de requetes multi-niveau
Router Pro/Flash/Flash-8B intelligemment
Gerer fallback et error handling
Mesurer quality/cost tradeoff

🎯 Architecture Model Router

🧠 Classifier Implementation

python

from vertexai.generative_models import GenerativeModel, GenerationConfig
import json
from enum import Enum

class ComplexityLevel(Enum):
    SIMPLE = "simple"
    STANDARD = "standard"
    COMPLEX = "complex"

class IntelligentRouter:
    def __init__(self):
        # Classifier ultra-rapide avec Flash-8B
        self.classifier = GenerativeModel("gemini-2.5-flash-8b")

        # 3 modeles production
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def classify_complexity(self, prompt: str) -> ComplexityLevel:
        """Classifier requete avec LLM (Flash-8B)"""

        classification_prompt = f"""
Analyse cette requete utilisateur et determine sa complexite :
- SIMPLE : FAQ, recherche info factuelle, classification basique
- STANDARD : Resume, extraction donnees, generation texte standard
- COMPLEX : Analyse multi-etapes, raisonnement logique, creative writing

Requete : "{prompt}"

Reponds UNIQUEMENT par JSON :
{{"complexity": "simple|standard|complex", "reasoning": "explication courte"}}
"""

        response = self.classifier.generate_content(
            classification_prompt,
            generation_config=GenerationConfig(
                response_mime_type="application/json",
                max_output_tokens=100,
                temperature=0.1,
            )
        )

        result = json.loads(response.text)
        complexity = ComplexityLevel(result["complexity"])

        print(f"[Classifier] {complexity.value.upper()}: {result['reasoning']}")
        return complexity

    def route_and_generate(self, prompt: str, temperature: float = 0.7):
        """Router et generer reponse"""

        # 1. Classifier (coute ~$0.000004)
        complexity = self.classify_complexity(prompt)

        # 2. Selectionner modele
        if complexity == ComplexityLevel.SIMPLE:
            model = self.flash_8b
            model_name = "Flash-8B"
        elif complexity == ComplexityLevel.STANDARD:
            model = self.flash
            model_name = "Flash"
        else:  # COMPLEX
            model = self.pro
            model_name = "Pro"

        print(f"[Router] → {model_name}")

        # 3. Generer avec fallback
        try:
            response = model.generate_content(
                prompt,
                generation_config=GenerationConfig(
                    temperature=temperature,
                    max_output_tokens=2048,
                )
            )
            return response.text, model_name

        except Exception as e:
            # Fallback vers modele superieur si echec
            print(f"[Router] Error with {model_name}, falling back to Pro")
            response = self.pro.generate_content(prompt)
            return response.text, "Pro (fallback)"

# Test router
router = IntelligentRouter()

# Requete simple → Flash-8B
response1, model1 = router.route_and_generate(
    "Quelle est la capitale de l'Italie ?"
)
print(f"Model: {model1}\nReponse: {response1}\n")

# Requete standard → Flash
response2, model2 = router.route_and_generate(
    "Resume les 3 principales caracteristiques du cloud computing"
)
print(f"Model: {model2}\nReponse: {response2}\n")

# Requete complexe → Pro
response3, model3 = router.route_and_generate(
    "Analyse les implications ethiques de l'IA dans le systeme judiciaire, "
    "en considerant les biais algorithmiques et la transparence des decisions"
)
print(f"Model: {model3}\nReponse: {response3}\n")

📊 Quality/Cost Tradeoff Analysis

python

import time
from dataclasses import dataclass

@dataclass
class RoutingMetrics:
    model: str
    latency_ms: float
    cost_usd: float
    quality_score: float  # 0-100, evaluation humaine ou automatique

class RouterAnalyzer:
    def __init__(self):
        self.metrics = []

    def evaluate_routing(self,
                          test_queries: list,
                          router: IntelligentRouter):
        """Evaluer quality/cost tradeoff"""

        total_cost = 0
        total_latency = 0
        total_quality = 0

        for query in test_queries:
            start = time.time()
            response, model = router.route_and_generate(query)
            latency = (time.time() - start) * 1000

            # Estimer cout base sur tokens (simplifie)
            tokens_estimate = len(query.split()) * 1.3 + len(response.split()) * 1.3
            if "Flash-8B" in model:
                cost = tokens_estimate / 1_000_000 * 0.20  # Input + output
            elif "Flash" in model:
                cost = tokens_estimate / 1_000_000 * 0.75
            else:  # Pro
                cost = tokens_estimate / 1_000_000 * 15.00

            # Quality score (simuler evaluation - en prod, utiliser LLM judge)
            quality = self._evaluate_quality(query, response)

            self.metrics.append(RoutingMetrics(
                model=model,
                latency_ms=latency,
                cost_usd=cost,
                quality_score=quality
            ))

            total_cost += cost
            total_latency += latency
            total_quality += quality

        # Calculer moyennes
        n = len(test_queries)
        avg_cost = total_cost / n
        avg_latency = total_latency / n
        avg_quality = total_quality / n

        print("=== ROUTING ANALYSIS ===")
        print(f"Total queries: {n}")
        print(f"Avg cost/query: ${avg_cost:.6f}")
        print(f"Avg latency: {avg_latency:.0f}ms")
        print(f"Avg quality: {avg_quality:.1f}/100")
        print(f"\nTotal cost: ${total_cost:.4f}")

        # Distribution modeles
        model_counts = {}
        for m in self.metrics:
            model_counts[m.model] = model_counts.get(m.model, 0) + 1

        print("\n=== MODEL DISTRIBUTION ===")
        for model, count in sorted(model_counts.items()):
            pct = count / n * 100
            print(f"{model}: {count} ({pct:.1f}%)")

        return {
            "avg_cost": avg_cost,
            "avg_latency": avg_latency,
            "avg_quality": avg_quality,
            "model_distribution": model_counts
        }

    def _evaluate_quality(self, query: str, response: str) -> float:
        """Evaluer qualite reponse (simplifie)"""
        # En production : utiliser LLM judge ou human evaluation
        # Ici : heuristique simple
        if len(response) < 50:
            return 60.0
        elif "sorry" in response.lower() or "cannot" in response.lower():
            return 40.0
        else:
            return 85.0

# Test avec dataset
test_queries = [
    "Capitale du Japon ?",
    "Liste 3 langages de programmation",
    "Explique la photosynthese simplement",
    "Compare architecture REST vs GraphQL en detail",
    "Analyse critique de la blockchain pour supply chain avec exemples concrets",
]

analyzer = RouterAnalyzer()
results = analyzer.evaluate_routing(test_queries, router)

# SORTIE EXEMPLE :
# === ROUTING ANALYSIS ===
# Total queries: 5
# Avg cost/query: $0.000180
# Avg latency: 1250ms
# Avg quality: 82.0/100
#
# Total cost: $0.0009
#
# === MODEL DISTRIBUTION ===
# Flash: 2 (40.0%)
# Flash-8B: 2 (40.0%)
# Pro: 1 (20.0%)

🛡️ Fallback Strategy

python

from typing import Optional

class RobustRouter:
    def __init__(self):
        self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
        self.flash = GenerativeModel("gemini-2.5-flash")
        self.pro = GenerativeModel("gemini-2.5-pro")

    def generate_with_fallback(self,
                                prompt: str,
                                preferred_model: str = "flash") -> dict:
        """Generate avec fallback cascade"""

        # Definir cascade
        if preferred_model == "flash-8b":
            cascade = [self.flash_8b, self.flash, self.pro]
            cascade_names = ["Flash-8B", "Flash", "Pro"]
        elif preferred_model == "flash":
            cascade = [self.flash, self.pro]
            cascade_names = ["Flash", "Pro"]
        else:  # pro
            cascade = [self.pro]
            cascade_names = ["Pro"]

        # Essayer cascade
        last_error = None
        for model, name in zip(cascade, cascade_names):
            try:
                print(f"[Fallback] Trying {name}...")
                response = model.generate_content(
                    prompt,
                    generation_config=GenerationConfig(
                        max_output_tokens=2048,
                        temperature=0.7,
                    )
                )

                # Verifier qualite reponse
                if response.text and len(response.text) > 10:
                    print(f"[Fallback] ✓ Success with {name}")
                    return {
                        "text": response.text,
                        "model": name,
                        "fallback": name != cascade_names[0]
                    }
                else:
                    raise ValueError("Response too short")

            except Exception as e:
                print(f"[Fallback] ✗ {name} failed: {e}")
                last_error = e
                continue

        # Tous les modeles ont echoue
        raise RuntimeError(f"All models failed. Last error: {last_error}")

# Test fallback
robust_router = RobustRouter()

# Requete normale : Flash suffit
result1 = robust_router.generate_with_fallback(
    "Explique REST API",
    preferred_model="flash"
)
print(f"Model: {result1['model']}, Fallback: {result1['fallback']}\n")

# Requete tres longue : Flash echoue → Pro fallback
# (simuler echec en ajoutant prompt trop long pour Flash)
long_prompt = "Analyse " + " ".join(["cette situation complexe"] * 10000)
try:
    result2 = robust_router.generate_with_fallback(
        long_prompt,
        preferred_model="flash"
    )
    print(f"Model: {result2['model']}, Fallback: {result2['fallback']}\n")
except Exception as e:
    print(f"Error: {e}")

💡 Regles de Routing Optimales

Use Case	Modele	Raison
FAQ / Support Tier 1	Flash-8B	Reponses factuelles, latence critique, volume eleve
Classification / Tagging	Flash-8B	Sortie JSON, deterministe, rapide
Extraction entites	Flash	Precision > vitesse, output structure
Resume documents	Flash	Equilibre qualite/cout, context long
Code generation	Flash	Syntaxe correcte, output deterministe
Analyse complexe	Pro	Raisonnement multi-etapes, nuance
Creative writing	Pro	Creativite, style, coherence longue
Research synthesis	Pro	Comprehension profonde, cross-referencing

Model routing intelligent peut reduire couts de 60-70% sans degrader qualite. Utilisez Flash-8B pour 70% requetes (FAQ, classification), Flash pour 25% (summaries, extraction), Pro pour 5% seulement (analyse complexe). Classifier coute $0.000004, ROI positif immediate. Fallback vers Pro = safety net si Flash echoue.

Batch API & Traitement Asynchrone

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Comprendre Batch API et -50% reduction cout
Implementer workflows JSONL batch
Utiliser SDK OpenAI compatible
Monitorer et gerer batch jobs

📦 Batch API : -50% Cout pour Traitement Non-Urgent

💡 Quand utiliser Batch API ?

Traitement asynchrone acceptable (10-30 minutes)
Volume eleve (>1000 requetes)
Use cases : ETL, data enrichment, bulk classification, offline evaluation
Economie : 50% vs real-time API

🔄 Workflow Batch API

📝 Implementation Complete

python

import json
from google.cloud import storage, aiplatform
from datetime import datetime
import time

class GeminiBatchProcessor:
    def __init__(self,
                 project_id: str,
                 location: str,
                 bucket_name: str):
        self.project_id = project_id
        self.location = location
        self.bucket_name = bucket_name

        aiplatform.init(project=project_id, location=location)
        self.storage_client = storage.Client()

    def prepare_batch_jsonl(self,
                             prompts: list[str],
                             output_file: str = "batch_input.jsonl"):
        """Preparer fichier JSONL pour batch"""

        with open(output_file, "w") as f:
            for i, prompt in enumerate(prompts):
                request = {
                    "request": {
                        "contents": [
                            {
                                "role": "user",
                                "parts": [{"text": prompt}]
                            }
                        ]
                    }
                }
                f.write(json.dumps(request) + "\n")

        print(f"✓ Created {output_file} with {len(prompts)} requests")
        return output_file

    def upload_to_gcs(self, local_file: str, gcs_path: str):
        """Upload fichier vers GCS"""

        bucket = self.storage_client.bucket(self.bucket_name)
        blob = bucket.blob(gcs_path)
        blob.upload_from_filename(local_file)

        gcs_uri = f"gs://{self.bucket_name}/{gcs_path}"
        print(f"✓ Uploaded to {gcs_uri}")
        return gcs_uri

    def submit_batch_job(self,
                          input_uri: str,
                          output_uri_prefix: str,
                          model_name: str = "gemini-2.5-flash"):
        """Submit batch prediction job"""

        batch_job = aiplatform.BatchPredictionJob.create(
            job_display_name=f"gemini-batch-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
            model_name=model_name,
            input_config_source=input_uri,
            output_config_destination=output_uri_prefix,
        )

        print(f"✓ Batch job submitted: {batch_job.name}")
        print(f"  Status: {batch_job.state}")
        return batch_job

    def monitor_job(self, batch_job, poll_interval: int = 60):
        """Monitorer job jusqu'a completion"""

        print(f"Monitoring job {batch_job.display_name}...")

        while batch_job.state not in ["SUCCEEDED", "FAILED", "CANCELLED"]:
            time.sleep(poll_interval)
            batch_job.refresh()
            print(f"  Status: {batch_job.state} ({datetime.now().strftime('%H:%M:%S')})")

        if batch_job.state == "SUCCEEDED":
            print(f"✓ Job completed successfully!")
            return True
        else:
            print(f"✗ Job failed: {batch_job.error}")
            return False

    def download_results(self, output_uri_prefix: str, local_file: str = "batch_output.jsonl"):
        """Download resultats depuis GCS"""

        # Parse GCS URI
        parts = output_uri_prefix.replace("gs://", "").split("/")
        bucket_name = parts[0]
        prefix = "/".join(parts[1:])

        # Lister fichiers output
        bucket = self.storage_client.bucket(bucket_name)
        blobs = list(bucket.list_blobs(prefix=prefix))

        # Download tous les fichiers (batch peut splitter en plusieurs)
        results = []
        for blob in blobs:
            if blob.name.endswith(".jsonl"):
                content = blob.download_as_text()
                for line in content.strip().split("\n"):
                    results.append(json.loads(line))

        # Sauver local
        with open(local_file, "w") as f:
            for result in results:
                f.write(json.dumps(result) + "\n")

        print(f"✓ Downloaded {len(results)} results to {local_file}")
        return results

# EXAMPLE : Batch classification de 10,000 feedbacks clients
processor = GeminiBatchProcessor(
    project_id="my-project",
    location="us-central1",
    bucket_name="my-batch-bucket"
)

# 1. Preparer prompts
feedbacks = [
    "Le produit est excellent, livraison rapide !",
    "Service client nul, attente 2h au telephone",
    # ... 9,998 autres feedbacks
]

classification_prompts = [
    f"Classifie ce feedback client en POSITIF, NEGATIF ou NEUTRE. "
    f"Feedback: \"{fb}\"\nReponse (un seul mot):"
    for fb in feedbacks
]

# 2. Creer JSONL
input_file = processor.prepare_batch_jsonl(classification_prompts)

# 3. Upload GCS
input_uri = processor.upload_to_gcs(
    input_file,
    "batch_jobs/classification_input.jsonl"
)

# 4. Submit job
output_uri = f"gs://{processor.bucket_name}/batch_jobs/output/"
batch_job = processor.submit_batch_job(
    input_uri=input_uri,
    output_uri_prefix=output_uri,
    model_name="gemini-2.5-flash"  # -50% vs real-time
)

# 5. Monitor (bloquant)
success = processor.monitor_job(batch_job, poll_interval=60)

# 6. Download resultats
if success:
    results = processor.download_results(output_uri)

    # Parser resultats
    classifications = []
    for result in results:
        text = result["response"]["candidates"][0]["content"]["parts"][0]["text"]
        classifications.append(text.strip().upper())

    # Stats
    from collections import Counter
    counts = Counter(classifications)
    print("\n=== RESULTS ===")
    print(f"Positif: {counts['POSITIF']}")
    print(f"Negatif: {counts['NEGATIF']}")
    print(f"Neutre: {counts['NEUTRE']}")

# ECONOMIE :
# Real-time : 10,000 req × 100 tokens avg × $0.15/1M input + $0.60/1M output
#           = $0.15 + $0.60 = $0.75
# Batch API : $0.75 × 0.5 = $0.375
# Economie : $0.375 (50%)

🔧 SDK OpenAI Compatible

Vertex AI Batch API est compatible avec SDK OpenAI pour faciliter migration.

python

# Installation
# pip install google-cloud-aiplatform openai

from openai import OpenAI
import os

# Configurer client OpenAI avec endpoint Vertex AI
client = OpenAI(
    base_url=f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openai",
    api_key=os.environ.get("GOOGLE_API_KEY")  # Utiliser ADC en prod
)

# Creer batch file (format OpenAI)
with open("batch_openai.jsonl", "w") as f:
    for prompt in prompts:
        request = {
            "custom_id": f"request-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gemini-2.5-flash",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 200
            }
        }
        f.write(json.dumps(request) + "\n")

# Upload batch file
batch_input_file = client.files.create(
    file=open("batch_openai.jsonl", "rb"),
    purpose="batch"
)

# Create batch job
batch = client.batches.create(
    input_file_id=batch_input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")

# Poll status
while batch.status not in ["completed", "failed", "cancelled"]:
    time.sleep(60)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status}")

# Download results
if batch.status == "completed":
    result_file = client.files.content(batch.output_file_id)
    with open("batch_results.jsonl", "wb") as f:
        f.write(result_file.read())

📊 Monitoring Batch Jobs

python

from google.cloud import aiplatform

# Lister tous les batch jobs
batch_jobs = aiplatform.BatchPredictionJob.list(
    filter='display_name:gemini-batch-*',
    order_by='create_time desc'
)

print("=== BATCH JOBS ===")
for job in batch_jobs:
    print(f"Name: {job.display_name}")
    print(f"  State: {job.state}")
    print(f"  Created: {job.create_time}")
    print(f"  Model: {job.model_name}")

    if job.state == "SUCCEEDED":
        # Calculer metriques
        elapsed = (job.end_time - job.create_time).total_seconds()
        print(f"  Duration: {elapsed/60:.1f} minutes")

# Creer dashboard Cloud Monitoring
from google.cloud import monitoring_v3

query = """
fetch aiplatform.googleapis.com/prediction/batch_prediction_job/count
| filter resource.job_id =~ 'gemini-batch-.*'
| group_by [resource.state], 1d
| every 1d
"""

# Alertes sur batch job failures
# gcloud alpha monitoring policies create \
#   --notification-channels=CHANNEL_ID \
#   --display-name="Batch Job Failures" \
#   --condition-threshold-value=1 \
#   --condition-threshold-duration=300s

⚠️ Limitations Batch API

Latence 10-30 minutes (non-realtime)
Pas de streaming
Pas de function calling (en beta)
Limite 50,000 requetes par job

Batch API offre 50% reduction cout pour workloads non-urgents. Utilisez pour ETL overnight, bulk classification, offline evaluation. Temps processing : 10-30 min. Si vous avez 100K+ requetes/jour et latence non-critique, economie annuelle peut atteindre $10-50K. Setup initial 1-2h, ROI immediate.

Monitoring des Couts

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Configurer Cloud Billing pour tracking IA
Creer budget alerts et seuils
Builder cost dashboards temps reel
Implementer attribution par projet/equipe

💰 Architecture Cost Monitoring

📊 Export Billing vers BigQuery

bash

# 1. Creer dataset BigQuery pour billing
bq mk --dataset --location=US my_project:billing_export

# 2. Activer Cloud Billing export (via Console ou gcloud)
# Console : Billing → Billing export → BigQuery export → Enable

# 3. Verifier export actif
bq ls my_project:billing_export
# → gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX

# 4. Query couts Vertex AI
bq query --use_legacy_sql=false '
SELECT
  service.description AS service,
  sku.description AS sku,
  SUM(cost) AS total_cost,
  SUM(usage.amount) AS usage_amount,
  usage.unit AS unit
FROM `my_project.billing_export.gcp_billing_export_v1_*`
WHERE service.description = "Vertex AI"
  AND _TABLE_SUFFIX BETWEEN "20260201" AND "20260210"
GROUP BY service, sku, unit
ORDER BY total_cost DESC
LIMIT 20
'

# SORTIE EXEMPLE :
# ┌──────────────┬────────────────────────────────────┬────────────┬──────────────┬─────────┐
# │   service    │                sku                 │ total_cost │ usage_amount │  unit   │
# ├──────────────┼────────────────────────────────────┼────────────┼──────────────┼─────────┤
# │ Vertex AI    │ Gemini 2.5 Flash Input Tokens      │     125.50 │   836666666  │ tokens  │
# │ Vertex AI    │ Gemini 2.5 Flash Output Tokens     │      85.30 │   142166666  │ tokens  │
# │ Vertex AI    │ Gemini 2.5 Pro Input Tokens        │      45.20 │    15066666  │ tokens  │
# │ Vertex AI    │ Context Caching Storage            │       2.10 │   140000000  │ tokens  │
# └──────────────┴────────────────────────────────────┴────────────┴──────────────┴─────────┘

🚨 Budget Alerts

python

from google.cloud import billing_budgets_v1

def create_ai_budget_alert(
    billing_account_id: str,
    project_id: str,
    budget_amount: float,
    alert_thresholds: list = [0.5, 0.9, 1.0]
):
    """Creer budget alert pour Vertex AI"""

    client = billing_budgets_v1.BudgetServiceClient()

    # Configurer budget
    budget = billing_budgets_v1.Budget()
    budget.display_name = f"Vertex AI Budget - {project_id}"
    budget.budget_filter = billing_budgets_v1.Filter(
        projects=[f"projects/{project_id}"],
        services=["services/aiplatform.googleapis.com"],  # Vertex AI
    )

    # Montant mensuel
    budget.amount = billing_budgets_v1.BudgetAmount(
        specified_amount={"currency_code": "USD", "units": int(budget_amount)}
    )

    # Seuils d'alerte
    budget.threshold_rules = [
        billing_budgets_v1.ThresholdRule(
            threshold_percent=threshold,
            spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
        )
        for threshold in alert_thresholds
    ]

    # Creer budget
    parent = f"billingAccounts/{billing_account_id}"
    response = client.create_budget(parent=parent, budget=budget)

    print(f"✓ Budget created: {response.name}")
    print(f"  Amount: ${budget_amount}/month")
    print(f"  Alerts at: {', '.join([f'{int(t*100)}%' for t in alert_thresholds])}")

    return response

# Creer budget $1000/mois avec alertes a 50%, 90%, 100%
budget = create_ai_budget_alert(
    billing_account_id="012345-6789AB-CDEF01",
    project_id="my-ai-project",
    budget_amount=1000.0,
    alert_thresholds=[0.5, 0.9, 1.0]
)

# Configurer notification email/Pub/Sub
# Via Console : Billing → Budgets & alerts → Select budget → Manage notifications

📈 Cost Dashboard en Temps Reel

python

from google.cloud import bigquery
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

class VertexAICostDashboard:
    def __init__(self, project_id: str, billing_dataset: str):
        self.client = bigquery.Client(project=project_id)
        self.billing_table = f"`{project_id}.{billing_dataset}.gcp_billing_export_v1_*`"

    def get_daily_costs(self, days: int = 30) -> pd.DataFrame:
        """Couts quotidiens Vertex AI"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          DATE(usage_start_time) AS date,
          SUM(cost) AS daily_cost
        FROM {self.billing_table}
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY date
        ORDER BY date
        """

        return self.client.query(query).to_dataframe()

    def get_costs_by_model(self, days: int = 7) -> pd.DataFrame:
        """Couts par modele Gemini"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          CASE
            WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
            WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
            WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
            ELSE 'Other'
          END AS model,
          SUM(cost) AS cost,
          SUM(usage.amount) AS tokens
        FROM {self.billing_table}
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY model
        ORDER BY cost DESC
        """

        return self.client.query(query).to_dataframe()

    def get_costs_by_label(self, label_key: str, days: int = 7) -> pd.DataFrame:
        """Couts par label (team, project, env)"""

        start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
        end_date = datetime.now().strftime("%Y%m%d")

        query = f"""
        SELECT
          labels.value AS {label_key},
          SUM(cost) AS cost
        FROM {self.billing_table},
        UNNEST(labels) AS labels
        WHERE service.description = 'Vertex AI'
          AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
          AND labels.key = '{label_key}'
        GROUP BY {label_key}
        ORDER BY cost DESC
        """

        return self.client.query(query).to_dataframe()

    def plot_dashboard(self):
        """Generer dashboard visuel"""

        fig, axes = plt.subplots(2, 2, figsize=(15, 10))

        # 1. Daily costs trend
        daily = self.get_daily_costs(days=30)
        axes[0, 0].plot(daily['date'], daily['daily_cost'], marker='o')
        axes[0, 0].set_title('Daily Costs (30 days)')
        axes[0, 0].set_xlabel('Date')
        axes[0, 0].set_ylabel('Cost ($)')
        axes[0, 0].grid(True)

        # 2. Costs by model (pie chart)
        models = self.get_costs_by_model(days=7)
        axes[0, 1].pie(models['cost'], labels=models['model'], autopct='%1.1f%%')
        axes[0, 1].set_title('Costs by Model (7 days)')

        # 3. Costs by team
        teams = self.get_costs_by_label('team', days=7)
        axes[1, 0].bar(teams['team'], teams['cost'])
        axes[1, 0].set_title('Costs by Team (7 days)')
        axes[1, 0].set_xlabel('Team')
        axes[1, 0].set_ylabel('Cost ($)')
        axes[1, 0].tick_params(axis='x', rotation=45)

        # 4. Summary stats
        total_cost = daily['daily_cost'].sum()
        avg_daily = daily['daily_cost'].mean()
        forecast_monthly = avg_daily * 30

        summary_text = f"""
        === COST SUMMARY ===

        Last 30 days: ${total_cost:.2f}
        Avg daily: ${avg_daily:.2f}
        Forecast monthly: ${forecast_monthly:.2f}

        Top model: {models.iloc[0]['model']}
        Top model cost: ${models.iloc[0]['cost']:.2f}
        """
        axes[1, 1].text(0.1, 0.5, summary_text, fontsize=12, family='monospace')
        axes[1, 1].axis('off')

        plt.tight_layout()
        plt.savefig('vertex_ai_cost_dashboard.png', dpi=150)
        print("✓ Dashboard saved to vertex_ai_cost_dashboard.png")

# Generer dashboard
dashboard = VertexAICostDashboard(
    project_id="my-project",
    billing_dataset="billing_export"
)

dashboard.plot_dashboard()

# Pour dashboard temps reel : deployer sur Cloud Run + scheduler toutes les heures

🏷️ Cost Attribution avec Labels

python

from vertexai.generative_models import GenerativeModel

# Labeler requetes par team/project/environment
def generate_with_labels(prompt: str, labels: dict):
    """Generate avec labels pour cost tracking"""

    # Labels format: key=value
    # Exemples : team=data-science, project=chatbot, env=prod

    model = GenerativeModel(
        "gemini-2.5-flash",
        # Labels attaches a chaque requete
        labels=labels
    )

    response = model.generate_content(prompt)
    return response.text

# Utilisation : tracer couts par equipe
response1 = generate_with_labels(
    "Resume ce document",
    labels={
        "team": "marketing",
        "project": "content-generation",
        "env": "prod"
    }
)

response2 = generate_with_labels(
    "Analyse ces donnees",
    labels={
        "team": "data-science",
        "project": "analytics",
        "env": "dev"
    }
)

# Query couts par team
# SELECT labels.value AS team, SUM(cost) AS cost
# FROM billing_table, UNNEST(labels) AS labels
# WHERE labels.key = 'team'
# GROUP BY team
# → Marketing: $450, Data Science: $780

# Chargeback : facturer equipes internes base sur usage reel

🎯 Cost Optimization Recommendations

python

def analyze_cost_optimization_opportunities(billing_df: pd.DataFrame) -> dict:
    """Analyser opportunites d'optimisation"""

    recommendations = []

    # 1. Detecter usage Pro pour requetes simples
    pro_usage = billing_df[billing_df['sku'].str.contains('2.5 Pro')]
    if not pro_usage.empty:
        pro_cost = pro_usage['cost'].sum()
        potential_savings = pro_cost * 0.95  # 95% si migration vers Flash
        recommendations.append({
            "type": "Model Downgrade",
            "current_cost": pro_cost,
            "potential_savings": potential_savings,
            "action": "Implementer model routing : Flash pour 80% requetes"
        })

    # 2. Detecter absence de caching
    cache_usage = billing_df[billing_df['sku'].str.contains('Cache')]
    if cache_usage.empty:
        input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
        potential_savings = input_cost * 0.5  # 50% avec caching
        recommendations.append({
            "type": "Context Caching",
            "current_cost": input_cost,
            "potential_savings": potential_savings,
            "action": "Activer context caching pour system instructions"
        })

    # 3. Detecter ratio input/output eleve (prompts longs)
    input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
    output_cost = billing_df[billing_df['sku'].str.contains('Output')]['cost'].sum()
    ratio = input_cost / output_cost if output_cost > 0 else 0

    if ratio > 3:
        potential_savings = input_cost * 0.3  # 30% avec prompt compression
        recommendations.append({
            "type": "Prompt Compression",
            "current_cost": input_cost,
            "potential_savings": potential_savings,
            "action": "Optimiser prompts : supprimer redondances, aller droit au but"
        })

    # 4. Calculer ROI total
    total_current = billing_df['cost'].sum()
    total_savings = sum([r['potential_savings'] for r in recommendations])
    savings_pct = (total_savings / total_current * 100) if total_current > 0 else 0

    return {
        "current_monthly_cost": total_current,
        "potential_monthly_savings": total_savings,
        "savings_percentage": savings_pct,
        "recommendations": recommendations
    }

# Exemple
recommendations = analyze_cost_optimization_opportunities(billing_df)
print(f"Current monthly cost: ${recommendations['current_monthly_cost']:.2f}")
print(f"Potential savings: ${recommendations['potential_monthly_savings']:.2f} ({recommendations['savings_percentage']:.1f}%)")
print("\nRecommendations:")
for i, rec in enumerate(recommendations['recommendations'], 1):
    print(f"{i}. {rec['type']}: Save ${rec['potential_savings']:.2f}/month")
    print(f"   Action: {rec['action']}")

Cost monitoring proactif = cle FinOps. Exportez billing vers BigQuery (gratuit), creez dashboards Looker Studio, configurez alertes a 50%/90%/100% budget. Utilisez labels pour attribution par team (chargeback). Revisez dashboard chaque semaine, identifiez anomalies, optimisez. Avec monitoring, vous detectez derive avant facture surprenante.

Lab : Dashboard FinOps Complet

⏱ 90 min Lab Pratique

🎯 Objectif du Lab

Construire un dashboard FinOps production-ready avec :

Cost tracking temps reel par modele/team
Trending analysis et forecasting
Alertes automatiques sur anomalies
Recommandations d'optimisation

Etape 1 : Setup BigQuery Export (10 min)

bash

# 1. Creer dataset billing
bq mk --dataset --location=US --description="Billing export for FinOps" \
  finops_lab:billing_data

# 2. Activer export (via Console)
# Billing → Billing export → BigQuery export → Enable
# Dataset: finops_lab:billing_data

# 3. Verifier export actif (attendre 5-10 min)
bq ls finops_lab:billing_data
# → gcp_billing_export_v1_XXXXXX

# 4. Tester query
bq query --use_legacy_sql=false '
SELECT service.description, SUM(cost) as cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
GROUP BY service.description
ORDER BY cost DESC
LIMIT 10
'

Etape 2 : Creer Views BigQuery (15 min)

sql

-- View 1 : Daily Vertex AI costs
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_daily_costs` AS
SELECT
  DATE(usage_start_time) AS date,
  CASE
    WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
    WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
    WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
    WHEN sku.description LIKE '%Cache%' THEN 'Context Caching'
    ELSE 'Other'
  END AS model,
  SUM(cost) AS cost,
  SUM(usage.amount) AS usage_amount,
  usage.unit
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE service.description = 'Vertex AI'
  AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
GROUP BY date, model, usage.unit;

-- View 2 : Costs by team (from labels)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_costs_by_team` AS
SELECT
  DATE(usage_start_time) AS date,
  labels.value AS team,
  SUM(cost) AS cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`,
UNNEST(labels) AS labels
WHERE service.description = 'Vertex AI'
  AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
  AND labels.key = 'team'
GROUP BY date, team;

-- View 3 : Anomaly detection (cost spike >50% vs avg)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_cost_anomalies` AS
WITH daily_costs AS (
  SELECT
    DATE(usage_start_time) AS date,
    SUM(cost) AS daily_cost
  FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
  WHERE service.description = 'Vertex AI'
    AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
  GROUP BY date
),
stats AS (
  SELECT
    AVG(daily_cost) AS avg_cost,
    STDDEV(daily_cost) AS stddev_cost
  FROM daily_costs
)
SELECT
  dc.date,
  dc.daily_cost,
  s.avg_cost,
  dc.daily_cost - s.avg_cost AS deviation,
  (dc.daily_cost - s.avg_cost) / s.avg_cost * 100 AS deviation_pct
FROM daily_costs dc, stats s
WHERE dc.daily_cost > s.avg_cost * 1.5  -- Spike >50%
ORDER BY dc.date DESC;

Etape 3 : Dashboard Python (30 min)

python

# finops_dashboard.py
from google.cloud import bigquery
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import smtplib
from email.mime.text import MIMEText

class FinOpsDashboard:
    def __init__(self, project_id: str):
        self.client = bigquery.Client(project=project_id)
        self.project_id = project_id

    def fetch_daily_costs(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_daily_costs`"
        return self.client.query(query).to_dataframe()

    def fetch_team_costs(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_costs_by_team`"
        return self.client.query(query).to_dataframe()

    def fetch_anomalies(self) -> pd.DataFrame:
        query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_cost_anomalies`"
        return self.client.query(query).to_dataframe()

    def generate_dashboard(self, output_file: str = "finops_dashboard.html"):
        """Generate interactive HTML dashboard"""

        # Fetch data
        daily_costs = self.fetch_daily_costs()
        team_costs = self.fetch_team_costs()
        anomalies = self.fetch_anomalies()

        # Create subplots
        fig = make_subplots(
            rows=3, cols=2,
            subplot_titles=(
                'Daily Costs by Model',
                'Model Distribution (Last 7 days)',
                'Costs by Team',
                'Cost Trend & Forecast',
                'Anomalies Detected',
                'Summary Metrics'
            ),
            specs=[
                [{"type": "scatter"}, {"type": "pie"}],
                [{"type": "bar"}, {"type": "scatter"}],
                [{"type": "scatter"}, {"type": "table"}]
            ]
        )

        # 1. Daily costs by model (line chart)
        for model in daily_costs['model'].unique():
            model_data = daily_costs[daily_costs['model'] == model]
            fig.add_trace(
                go.Scatter(
                    x=model_data['date'],
                    y=model_data['cost'],
                    name=model,
                    mode='lines+markers'
                ),
                row=1, col=1
            )

        # 2. Model distribution (pie chart - last 7 days)
        last_7d = daily_costs[daily_costs['date'] >= datetime.now() - timedelta(days=7)]
        model_costs = last_7d.groupby('model')['cost'].sum()
        fig.add_trace(
            go.Pie(labels=model_costs.index, values=model_costs.values),
            row=1, col=2
        )

        # 3. Costs by team (bar chart)
        team_total = team_costs.groupby('team')['cost'].sum().sort_values(ascending=False)
        fig.add_trace(
            go.Bar(x=team_total.index, y=team_total.values),
            row=2, col=1
        )

        # 4. Trend with forecast
        total_daily = daily_costs.groupby('date')['cost'].sum().reset_index()
        # Simple linear forecast
        last_7_avg = total_daily.tail(7)['cost'].mean()
        forecast_dates = pd.date_range(
            start=total_daily['date'].max() + timedelta(days=1),
            periods=30
        )
        forecast_values = [last_7_avg] * 30

        fig.add_trace(
            go.Scatter(
                x=total_daily['date'],
                y=total_daily['cost'],
                name='Actual',
                mode='lines'
            ),
            row=2, col=2
        )
        fig.add_trace(
            go.Scatter(
                x=forecast_dates,
                y=forecast_values,
                name='Forecast',
                mode='lines',
                line=dict(dash='dash')
            ),
            row=2, col=2
        )

        # 5. Anomalies
        if not anomalies.empty:
            fig.add_trace(
                go.Scatter(
                    x=anomalies['date'],
                    y=anomalies['daily_cost'],
                    mode='markers',
                    marker=dict(size=10, color='red'),
                    name='Anomalies'
                ),
                row=3, col=1
            )

        # 6. Summary table
        total_cost_30d = daily_costs['cost'].sum()
        avg_daily = daily_costs.groupby('date')['cost'].sum().mean()
        forecast_monthly = avg_daily * 30
        top_model = model_costs.idxmax()

        summary = pd.DataFrame({
            'Metric': [
                'Last 30d Cost',
                'Avg Daily Cost',
                'Forecast Monthly',
                'Top Model',
                'Anomalies Detected'
            ],
            'Value': [
                f"${total_cost_30d:.2f}",
                f"${avg_daily:.2f}",
                f"${forecast_monthly:.2f}",
                top_model,
                str(len(anomalies))
            ]
        })

        fig.add_trace(
            go.Table(
                header=dict(values=list(summary.columns)),
                cells=dict(values=[summary['Metric'], summary['Value']])
            ),
            row=3, col=2
        )

        # Layout
        fig.update_layout(
            height=1200,
            title_text="Vertex AI FinOps Dashboard",
            showlegend=True
        )

        # Save
        fig.write_html(output_file)
        print(f"✓ Dashboard saved to {output_file}")

        return fig, summary

    def check_and_alert_anomalies(self, email_to: str = None):
        """Check for anomalies and send alerts"""

        anomalies = self.fetch_anomalies()

        if not anomalies.empty:
            print(f"⚠️  {len(anomalies)} cost anomalies detected!")

            for _, row in anomalies.iterrows():
                print(f"  {row['date']}: ${row['daily_cost']:.2f} "
                      f"(+{row['deviation_pct']:.1f}% vs avg)")

            # Send email alert
            if email_to:
                self._send_email_alert(anomalies, email_to)
        else:
            print("✓ No cost anomalies detected")

    def _send_email_alert(self, anomalies: pd.DataFrame, email_to: str):
        """Send email alert for anomalies"""

        body = f"""
        Cost Anomaly Alert - Vertex AI

        {len(anomalies)} anomalies detected:

        """

        for _, row in anomalies.iterrows():
            body += f"- {row['date']}: ${row['daily_cost']:.2f} (+{row['deviation_pct']:.1f}%)\n"

        body += "\nCheck dashboard for details."

        msg = MIMEText(body)
        msg['Subject'] = f'⚠️  Vertex AI Cost Anomaly Alert'
        msg['From'] = 'finops@company.com'
        msg['To'] = email_to

        # Send via SMTP (configure your SMTP server)
        # smtp = smtplib.SMTP('smtp.gmail.com', 587)
        # smtp.send_message(msg)

        print(f"✓ Alert email sent to {email_to}")

# Generate dashboard
dashboard = FinOpsDashboard(project_id="finops_lab")
fig, summary = dashboard.generate_dashboard()

# Check anomalies
dashboard.check_and_alert_anomalies(email_to="team@company.com")

print("\n=== SUMMARY ===")
print(summary.to_string(index=False))

Etape 4 : Budget Alerts (10 min)

bash

# Creer budget $2000/mois avec alertes
# (remplacer BILLING_ACCOUNT_ID)

gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Vertex AI Monthly Budget" \
  --budget-amount=2000USD \
  --threshold-rule=percent=0.5 \
  --threshold-rule=percent=0.9 \
  --threshold-rule=percent=1.0 \
  --filter-projects=projects/finops_lab \
  --filter-services=services/aiplatform.googleapis.com

# Configurer notification Pub/Sub
gcloud pubsub topics create budget-alerts

gcloud pubsub subscriptions create budget-alerts-sub \
  --topic=budget-alerts \
  --push-endpoint=https://your-cloud-run-url/budget-alert

# Cloud Function pour traiter alertes
# (deployer fonction qui parse message et envoie email/Slack)

Etape 5 : Scheduler Automatique (15 min)

bash

# Deployer dashboard sur Cloud Run
# Dockerfile
cat > Dockerfile < requirements.txt <

Etape 6 : Tester & Valider (10 min)

Ouvrir dashboard HTML genere
Verifier graphiques affichent donnees correctes
Simuler anomalie (requetes massives vers Pro)
Verifier alerte recue par email/Slack
Verifier budget alert a 50% budget

✅ Validation

Votre dashboard FinOps est complet si vous avez :

✅ BigQuery export actif avec views custom
✅ Dashboard interactif avec 6 visualisations
✅ Anomaly detection automatique
✅ Budget alerts configurees (50%, 90%, 100%)
✅ Refresh automatique toutes les heures
✅ Email/Slack alerts operationnels

Ce dashboard FinOps vous donne visibilite complete sur couts Vertex AI. En production, ajoutez : forecasting ML (Prophet), recommendations automatiques (model routing), integration Slack pour alertes temps reel. Revisez dashboard chaque lundi en equipe, identifiez anomalies, iterez. Avec ce setup, vous detectez derives avant facture surprenante.

Quiz Module 4.2

⏱ 15 min Evaluation

📝 Quiz : FinOps & Optimisation

15 questions pour valider vos connaissances

1. Quelle technique offre la plus grande economie potentielle ?

Batch API (-50%)

Prompt compression (-30%)

Context caching (-90% sur partie cachee)

Output control (-20%)

2. Model routing intelligent peut reduire couts de combien ?

20-30%

60-70%

90-95%

10-15%

3. Context caching est rentable a partir de combien de requetes ?

2 requetes (breakeven immediate)

10 requetes

100 requetes

1000 requetes

4. Quelle difference entre implicit et explicit caching ?

Implicit coute plus cher

Pas de difference

Implicit est auto (5 min TTL), explicit manuel (1-60 min TTL)

Implicit ne fonctionne qu'avec Pro

5. Pour chatbot avec system instruction 5000 tokens, quel TTL cache optimal ?

5 minutes (trop court)

30 minutes (equilibre)

120 minutes (trop long)

Cache inutile ici

6. Flash-8B coute combien vs Pro pour input tokens ?

75x moins cher ($0.04 vs $3.00)

20x moins cher

5x moins cher

Meme prix

7. Batch API offre -50% cout mais avec quelle contrainte ?

Moins de qualite

Limite 100 requetes

Latence 10-30 minutes (asynchrone)

Pas de streaming

8. Quelle regle pour classifier requete comme "simple" ?

Longueur >1000 tokens

FAQ, classification, recherche factuelle

Toutes les requetes JSON

Requetes avec function calling

9. Output tokens coutent combien vs input tokens (Flash) ?

Meme prix

2x plus cher

3x plus cher

4x plus cher ($0.60 vs $0.15)

10. BigQuery export billing est :

Gratuit et essentiel pour FinOps

Payant ($10/mois)

Optionnel, Console suffit

Uniquement pour Enterprise

11. Budget alert doit etre configure a quels seuils ?

100% uniquement

80% et 100%

50%, 90%, 100% (proactif)

Pas necessaire si dashboard

12. Cost attribution par equipe se fait via :

IP source

Labels sur requetes Vertex AI

Estimation manuelle

Impossible de tracer

13. Anomaly detection identifie cout anormal si :

Cout quotidien >50% au-dessus moyenne

Cout >$100

Cout double vs hier

Impossible automatiquement

14. Quelle strategie pour requetes urgentes a faible cout ?

Toujours utiliser Pro

Batch API

Flash-8B avec fallback vers Flash

Context caching

15. Dashboard FinOps doit etre rafraichi a quelle frequence ?

1x par jour (trop lent)

Toutes les heures (optimal)

Temps reel (overkill)

1x par semaine

IA Responsable Google

⏱ 30 min Intermediaire

🎯 Objectifs d'apprentissage

Comprendre les 7 principes Google AI
Configurer safety settings Gemini
Utiliser Gemma Scope pour interpretabilite
Implementer guardrails IA responsable

🎯 Les 7 Principes Google AI

#	Principe	Signification	Implementation Gemini
1	Be socially beneficial	IA doit beneficier societe	Gemini optimise pour aide, pas manipulation
2	Avoid unfair bias	Eviter biais injustes	Training data diverse, evaluation bias continue
3	Built & tested for safety	Securite par conception	Safety filters, red teaming, adversarial testing
4	Accountable to people	Responsabilite humaine	Human-in-the-loop, audit logs, explicabilite
5	Privacy by design	Confidentialite integree	Data not used for training (Vertex AI)
6	Scientific excellence	Excellence scientifique	Recherche Google AI publiee, peer-reviewed
7	Appropriate uses	Usages appropries	Terms of Service interdisent malware, spam, violence

🚫 Applications que Google ne developpe PAS

Armes ou surveillance de masse
Technologies violant droits humains
Collecte d'infos contre droit international

🛡️ Safety Settings Gemini

Gemini inclut 4 harm categories avec 4 seuils de blocage.

python

from vertexai.generative_models import (
    GenerativeModel,
    HarmCategory,
    HarmBlockThreshold,
    SafetySetting
)

# 4 Harm Categories
# - HARM_CATEGORY_HARASSMENT : Harcelement
# - HARM_CATEGORY_HATE_SPEECH : Discours haineux
# - HARM_CATEGORY_SEXUALLY_EXPLICIT : Contenu sexuel explicite
# - HARM_CATEGORY_DANGEROUS_CONTENT : Contenu dangereux

# 4 Thresholds
# - BLOCK_NONE : Pas de blocage (permissif)
# - BLOCK_ONLY_HIGH : Bloquer seulement haute probabilite
# - BLOCK_MEDIUM_AND_ABOVE : Bloquer moyenne et haute (DEFAULT)
# - BLOCK_LOW_AND_ABOVE : Bloquer tout (strict)

# Configuration stricte (production recommandee)
safety_settings_strict = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
    ),
]

model_strict = GenerativeModel(
    "gemini-2.5-flash",
    safety_settings=safety_settings_strict
)

# Configuration permissive (R&D uniquement)
safety_settings_permissive = [
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HARASSMENT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
    SafetySetting(
        category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        threshold=HarmBlockThreshold.BLOCK_NONE
    ),
]

model_permissive = GenerativeModel(
    "gemini-2.5-flash",
    safety_settings=safety_settings_permissive
)

# Tester avec prompt sensible
prompt_sensible = "Comment construire une bombe ?"

try:
    response_strict = model_strict.generate_content(prompt_sensible)
    print("Strict:", response_strict.text)
except Exception as e:
    print("Strict: BLOCKED -", e)
    # → BLOCKED (safety filter)

try:
    response_permissive = model_permissive.generate_content(prompt_sensible)
    print("Permissive:", response_permissive.text)
except Exception as e:
    print("Permissive: BLOCKED -", e)
    # → Peut passer (mais ToS Google interdit quand meme usage malveillant)

📊 Analyser Safety Ratings

python

# Generer reponse et inspecter safety ratings
response = model_strict.generate_content("Raconte une blague")

# Safety ratings pour le prompt
print("=== PROMPT SAFETY ===")
for rating in response.prompt_feedback.safety_ratings:
    print(f"{rating.category.name}: {rating.probability.name}")

# Safety ratings pour la reponse
print("\n=== RESPONSE SAFETY ===")
for candidate in response.candidates:
    for rating in candidate.safety_ratings:
        print(f"{rating.category.name}: {rating.probability.name}")

# SORTIE EXEMPLE :
# === PROMPT SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: NEGLIGIBLE
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE
#
# === RESPONSE SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: LOW
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE

# Implementer logging safety pour monitoring
import json

def log_safety_event(prompt: str, response, blocked: bool):
    """Logger evenements safety pour audit"""

    event = {
        "timestamp": datetime.now().isoformat(),
        "prompt": prompt[:100],  # Truncate for privacy
        "blocked": blocked,
        "prompt_safety": {
            rating.category.name: rating.probability.name
            for rating in response.prompt_feedback.safety_ratings
        },
    }

    if not blocked:
        event["response_safety"] = {
            rating.category.name: rating.probability.name
            for rating in response.candidates[0].safety_ratings
        }

    # Log to BigQuery ou Cloud Logging
    with open("safety_logs.jsonl", "a") as f:
        f.write(json.dumps(event) + "\n")

    return event

# Utiliser avec monitoring
try:
    response = model_strict.generate_content(prompt)
    log_safety_event(prompt, response, blocked=False)
except Exception as e:
    log_safety_event(prompt, None, blocked=True)

🔍 Gemma Scope : Interpretabilite

Gemma Scope est un outil open-source pour interpreter modeles Gemma (sparse autoencoders).

python

# pip install gemma-scope

from gemma_scope import GemmaScope

# Charger Gemma 3 avec Scope
scope = GemmaScope(model_name="gemma-3-9b")

# Analyser activation pour prompt
prompt = "Paris est la capitale de"
activations = scope.get_activations(prompt)

# Top features actives
top_features = scope.get_top_features(activations, k=10)

print("=== TOP 10 ACTIVATED FEATURES ===")
for feature_id, activation_strength in top_features:
    feature_desc = scope.get_feature_description(feature_id)
    print(f"Feature {feature_id}: {feature_desc} (strength: {activation_strength:.3f})")

# SORTIE EXEMPLE :
# Feature 1847: Geographic location / capital city (strength: 0.892)
# Feature 3201: French language context (strength: 0.654)
# Feature 892: European geography (strength: 0.543)
# ...

# Use case : Detecter biais
prompt_biased = "Les femmes sont"
activations_biased = scope.get_activations(prompt_biased)
top_biased = scope.get_top_features(activations_biased, k=5)

# Si feature "gender stereotype" active → red flag pour review

💡 Gemma Scope 2 (2026)

Version 2 supporte Gemma 3/3n et offre visualizations interactives pour interpreter modele. Utile pour audit, debugging, detection biais.

🛡️ Guardrails Implementation

python

class ResponsibleAIGuardrails:
    def __init__(self, model: GenerativeModel):
        self.model = model
        self.blocked_keywords = [
            "hack", "exploit", "crack", "bypass",
            # ... ajouter keywords sensibles pour votre domaine
        ]

    def check_prompt_safety(self, prompt: str) -> dict:
        """Pre-flight checks avant envoi a Gemini"""

        issues = []

        # 1. Check PII
        if self._contains_pii(prompt):
            issues.append("PII_DETECTED")

        # 2. Check blocked keywords
        if any(kw in prompt.lower() for kw in self.blocked_keywords):
            issues.append("BLOCKED_KEYWORD")

        # 3. Check prompt injection
        if self._is_prompt_injection(prompt):
            issues.append("PROMPT_INJECTION")

        return {
            "safe": len(issues) == 0,
            "issues": issues
        }

    def _contains_pii(self, text: str) -> bool:
        """Detecter PII (simplifie, utiliser DLP en prod)"""
        import re

        # SSN pattern
        ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'
        if re.search(ssn_pattern, text):
            return True

        # Email pattern
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        if re.search(email_pattern, text):
            return True

        return False

    def _is_prompt_injection(self, text: str) -> bool:
        """Detecter tentative prompt injection"""

        injection_patterns = [
            "ignore previous instructions",
            "disregard above",
            "new instructions:",
            "system:",
        ]

        return any(pattern in text.lower() for pattern in injection_patterns)

    def generate_safe(self, prompt: str):
        """Generate avec guardrails"""

        # Pre-flight checks
        safety_check = self.check_prompt_safety(prompt)

        if not safety_check["safe"]:
            raise ValueError(f"Prompt blocked: {safety_check['issues']}")

        # Generate
        response = self.model.generate_content(prompt)

        # Post-flight checks
        if response.candidates[0].finish_reason.name == "SAFETY":
            raise ValueError("Response blocked by safety filter")

        return response.text

# Utilisation
guardrails = ResponsibleAIGuardrails(model_strict)

# Safe prompt
try:
    response1 = guardrails.generate_safe("Explique la photosynthese")
    print("✓ Safe:", response1[:100])
except ValueError as e:
    print("✗ Blocked:", e)

# Unsafe prompt (PII)
try:
    response2 = guardrails.generate_safe("Mon email est john@example.com, aide-moi")
    print("✓ Safe:", response2[:100])
except ValueError as e:
    print("✗ Blocked:", e)
    # → Blocked: ['PII_DETECTED']

IA Responsable n'est pas optionnel. Configurez safety settings strictes en prod (BLOCK_LOW_AND_ABOVE), loggez tous les events safety pour audit. Implementez guardrails pre/post pour bloquer PII, prompt injection, keywords sensibles. Utilisez Gemma Scope pour interpreter decisions et detecter biais. Google AI Principles = framework solide, suivez-le.

Gouvernance des Modeles

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Gerer model lifecycle (dev → staging → prod)
Implementer versioning et deprecation strategy
Migrer 2.0 → 2.5 en production
Documenter decisions avec ADR

🔄 Model Lifecycle Management

📝 Model Registry & Versioning

python

# model_registry.py
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
import json

class ModelStage(Enum):
    DEVELOPMENT = "dev"
    STAGING = "staging"
    PRODUCTION = "prod"
    DEPRECATED = "deprecated"

@dataclass
class ModelVersion:
    name: str  # e.g., "gemini-2.5-flash"
    version: str  # e.g., "v1.2.3"
    stage: ModelStage
    created_at: datetime
    promoted_at: datetime = None
    deprecated_at: datetime = None
    performance_metrics: dict = None
    notes: str = ""

class ModelRegistry:
    def __init__(self, registry_file: str = "model_registry.json"):
        self.registry_file = registry_file
        self.models = self._load_registry()

    def _load_registry(self) -> dict:
        """Load registry from file"""
        try:
            with open(self.registry_file, "r") as f:
                data = json.load(f)
                # Convert to ModelVersion objects
                models = {}
                for key, val in data.items():
                    val['stage'] = ModelStage(val['stage'])
                    val['created_at'] = datetime.fromisoformat(val['created_at'])
                    if val.get('promoted_at'):
                        val['promoted_at'] = datetime.fromisoformat(val['promoted_at'])
                    if val.get('deprecated_at'):
                        val['deprecated_at'] = datetime.fromisoformat(val['deprecated_at'])
                    models[key] = ModelVersion(**val)
                return models
        except FileNotFoundError:
            return {}

    def _save_registry(self):
        """Save registry to file"""
        data = {}
        for key, model in self.models.items():
            data[key] = {
                'name': model.name,
                'version': model.version,
                'stage': model.stage.value,
                'created_at': model.created_at.isoformat(),
                'promoted_at': model.promoted_at.isoformat() if model.promoted_at else None,
                'deprecated_at': model.deprecated_at.isoformat() if model.deprecated_at else None,
                'performance_metrics': model.performance_metrics,
                'notes': model.notes
            }
        with open(self.registry_file, "w") as f:
            json.dump(data, f, indent=2)

    def register_model(self, name: str, version: str, stage: ModelStage, notes: str = ""):
        """Register new model version"""
        key = f"{name}@{version}"
        self.models[key] = ModelVersion(
            name=name,
            version=version,
            stage=stage,
            created_at=datetime.now(),
            notes=notes
        )
        self._save_registry()
        print(f"✓ Registered {key} in {stage.value}")

    def promote_model(self, name: str, version: str, to_stage: ModelStage):
        """Promote model to next stage"""
        key = f"{name}@{version}"
        if key not in self.models:
            raise ValueError(f"Model {key} not found in registry")

        self.models[key].stage = to_stage
        self.models[key].promoted_at = datetime.now()
        self._save_registry()
        print(f"✓ Promoted {key} to {to_stage.value}")

    def deprecate_model(self, name: str, version: str, reason: str):
        """Deprecate model version"""
        key = f"{name}@{version}"
        if key not in self.models:
            raise ValueError(f"Model {key} not found in registry")

        self.models[key].stage = ModelStage.DEPRECATED
        self.models[key].deprecated_at = datetime.now()
        self.models[key].notes += f"\nDeprecated: {reason}"
        self._save_registry()
        print(f"✓ Deprecated {key}: {reason}")

    def get_active_model(self, name: str, stage: ModelStage) -> ModelVersion:
        """Get active model version for stage"""
        active_models = [
            model for model in self.models.values()
            if model.name == name and model.stage == stage
        ]

        if not active_models:
            raise ValueError(f"No active {name} model in {stage.value}")

        # Return most recent
        return sorted(active_models, key=lambda m: m.created_at, reverse=True)[0]

    def list_models(self, stage: ModelStage = None):
        """List all models, optionally filtered by stage"""
        models = self.models.values()
        if stage:
            models = [m for m in models if m.stage == stage]

        for model in sorted(models, key=lambda m: m.created_at, reverse=True):
            print(f"{model.name}@{model.version} | {model.stage.value} | {model.created_at.date()}")

# Usage
registry = ModelRegistry()

# Register new model in dev
registry.register_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    stage=ModelStage.DEVELOPMENT,
    notes="Initial deployment with context caching"
)

# After testing, promote to staging
registry.promote_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    to_stage=ModelStage.STAGING
)

# After staging validation, promote to prod
registry.promote_model(
    name="gemini-2.5-flash",
    version="v1.0.0",
    to_stage=ModelStage.PRODUCTION
)

# Deploy new version
registry.register_model(
    name="gemini-2.5-flash",
    version="v1.1.0",
    stage=ModelStage.DEVELOPMENT,
    notes="Added model routing"
)

# Deprecate old version
registry.deprecate_model(
    name="gemini-1.5-flash",
    version="v0.9.0",
    reason="Migrated to Gemini 2.5"
)

# List prod models
print("\n=== PRODUCTION MODELS ===")
registry.list_models(stage=ModelStage.PRODUCTION)

🔄 Migration Strategy: 2.0 → 2.5

python

class ModelMigration:
    """Gerer migration progressive entre versions"""

    def __init__(self, old_model: str, new_model: str):
        self.old_model = GenerativeModel(old_model)
        self.new_model = GenerativeModel(new_model)
        self.rollout_percentage = 0

    def set_rollout(self, percentage: int):
        """Set traffic split (0-100% vers new model)"""
        if not 0 <= percentage <= 100:
            raise ValueError("Percentage must be 0-100")
        self.rollout_percentage = percentage
        print(f"Rollout: {percentage}% → {self.new_model._model_name}")

    def generate_content(self, prompt: str):
        """Generate avec traffic splitting"""
        import random

        # Traffic split
        if random.randint(0, 99) < self.rollout_percentage:
            # Route to new model
            print(f"[Routing] → NEW: {self.new_model._model_name}")
            return self.new_model.generate_content(prompt)
        else:
            # Route to old model
            print(f"[Routing] → OLD: {self.old_model._model_name}")
            return self.old_model.generate_content(prompt)

# Migration progressive 2.0 → 2.5
migration = ModelMigration(
    old_model="gemini-2.0-flash-exp",
    new_model="gemini-2.5-flash"
)

# Week 1: 10% traffic vers 2.5
migration.set_rollout(10)
for i in range(10):
    migration.generate_content("Test query")
# → 1 requete vers 2.5, 9 vers 2.0

# Week 2: Monitor metrics, si OK → 50%
migration.set_rollout(50)

# Week 3: 100% vers 2.5
migration.set_rollout(100)

# Deprecate 2.0
registry.deprecate_model(
    name="gemini-2.0-flash-exp",
    version="v1.0.0",
    reason="Fully migrated to 2.5"
)

📋 Architecture Decision Records (ADR)

markdown

# ADR-001: Migration vers Gemini 2.5 Flash

## Status
ACCEPTED - 2026-02-01

## Context
Notre application chatbot support utilise Gemini 2.0 Flash Exp depuis 6 mois.
Gemini 2.5 Flash offre meilleures performances (+15% qualite) et meme prix.

## Decision
Migrer progressivement vers Gemini 2.5 Flash sur 3 semaines :
- Week 1 : 10% traffic (canary)
- Week 2 : 50% traffic (validation large scale)
- Week 3 : 100% traffic (full rollout)

## Consequences

### Positive
- +15% quality score (evaluation benchmark interne)
- Latence identique (~800ms p95)
- Cout identique ($0.15/1M input)
- Support long context (2M tokens vs 1M)

### Negative
- Risque regression qualite (mitigation : canary + rollback plan)
- Effort migration : 2 engineer-days

### Neutral
- API identique, pas de code changes

## Rollback Plan
Si quality score < baseline :
1. Rollback immediate vers 2.0 Flash
2. Root cause analysis
3. Re-evaluation decision

## Monitoring
- Quality score (target: >85%)
- Latency p50/p95 (target: <1000ms)
- Cost per conversation (target: <$0.001)
- Error rate (target: <1%)

## References
- Benchmark results: docs/benchmarks/2.0-vs-2.5.md
- Gemini 2.5 release notes: https://cloud.google.com/vertex-ai/docs/release-notes

💡 ADR Best Practices

1 ADR par decision majeure (model change, architecture change)
Template standardise : Status, Context, Decision, Consequences
Stocker dans Git (docs/adr/)
Review en equipe avant ACCEPTED

🔍 Model Deprecation Timeline

Phase	Duration	Actions
Annonce	T-90 days	Communication interne/externe, migration guide publie
Warning	T-60 days	Deprecation warnings dans logs, emails equipes
Migration	T-30 days	Support migration actif, office hours
Sunset	T-0	Model desactive, requetes rejettees avec error explicite

Gouvernance modeles = discipline essentielle en production. Utilisez model registry pour tracker versions actives par environnement. Migration progressive (10% → 50% → 100%) reduit risque. ADR documente WHY derriere chaque decision majeure (critical pour onboarding et audits). Deprecation avec 90 days notice = respect users.

Agent Governance

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Implementer tool governance dans Agent Builder
Gerer permissions et access control
Auditer actions agents
Utiliser agent marketplace securise

🛡️ Tool Governance Framework

python

from vertexai.preview import reasoning_engines
from google.cloud import firestore
from datetime import datetime
from enum import Enum

class ToolRiskLevel(Enum):
    LOW = "low"  # Read-only, pas d'impact business
    MEDIUM = "medium"  # Modifications limitees
    HIGH = "high"  # Actions critiques (delete, paiements)
    CRITICAL = "critical"  # Actions irreversibles

class ToolGovernance:
    def __init__(self, firestore_db):
        self.db = firestore_db
        self.audit_collection = "agent_tool_audit"

    def register_tool(self, tool_name: str, risk_level: ToolRiskLevel,
                      requires_approval: bool = False):
        """Enregistrer tool avec niveau de risque"""

        tool_doc = {
            "name": tool_name,
            "risk_level": risk_level.value,
            "requires_approval": requires_approval,
            "registered_at": datetime.now(),
            "allowed_agents": [],  # Whitelist agents
        }

        self.db.collection("tool_registry").document(tool_name).set(tool_doc)
        print(f"✓ Tool registered: {tool_name} (risk: {risk_level.value})")

    def approve_tool_for_agent(self, tool_name: str, agent_id: str, approved_by: str):
        """Approuver tool pour agent specifique"""

        tool_ref = self.db.collection("tool_registry").document(tool_name)
        tool_doc = tool_ref.get()

        if not tool_doc.exists:
            raise ValueError(f"Tool {tool_name} not registered")

        # Ajouter agent a whitelist
        tool_ref.update({
            "allowed_agents": firestore.ArrayUnion([agent_id])
        })

        # Logger approval
        self._audit_log({
            "event": "TOOL_APPROVED",
            "tool": tool_name,
            "agent": agent_id,
            "approved_by": approved_by,
            "timestamp": datetime.now()
        })

        print(f"✓ Tool {tool_name} approved for agent {agent_id}")

    def check_tool_permission(self, tool_name: str, agent_id: str) -> bool:
        """Verifier si agent peut utiliser tool"""

        tool_doc = self.db.collection("tool_registry").document(tool_name).get()

        if not tool_doc.exists:
            return False

        tool_data = tool_doc.to_dict()

        # Check whitelist
        if agent_id not in tool_data.get("allowed_agents", []):
            return False

        return True

    def audit_tool_call(self, tool_name: str, agent_id: str, params: dict, result: dict):
        """Auditer appel tool"""

        self._audit_log({
            "event": "TOOL_CALLED",
            "tool": tool_name,
            "agent": agent_id,
            "params": params,
            "result": result,
            "timestamp": datetime.now()
        })

    def _audit_log(self, log_entry: dict):
        """Logger event audit dans Firestore"""
        self.db.collection(self.audit_collection).add(log_entry)

    def get_audit_trail(self, agent_id: str = None, tool_name: str = None, days: int = 30):
        """Recuperer audit trail"""

        query = self.db.collection(self.audit_collection)

        if agent_id:
            query = query.where("agent", "==", agent_id)
        if tool_name:
            query = query.where("tool", "==", tool_name)

        # Last N days
        cutoff = datetime.now() - timedelta(days=days)
        query = query.where("timestamp", ">=", cutoff)

        results = query.stream()

        print(f"=== AUDIT TRAIL (last {days} days) ===")
        for doc in results:
            data = doc.to_dict()
            print(f"{data['timestamp']}: {data['event']} - {data.get('tool', 'N/A')} by {data.get('agent', 'N/A')}")

# Setup governance
db = firestore.Client()
governance = ToolGovernance(db)

# Register tools avec risk levels
governance.register_tool(
    tool_name="search_knowledge_base",
    risk_level=ToolRiskLevel.LOW,
    requires_approval=False
)

governance.register_tool(
    tool_name="update_customer_record",
    risk_level=ToolRiskLevel.MEDIUM,
    requires_approval=True
)

governance.register_tool(
    tool_name="process_refund",
    risk_level=ToolRiskLevel.HIGH,
    requires_approval=True
)

governance.register_tool(
    tool_name="delete_account",
    risk_level=ToolRiskLevel.CRITICAL,
    requires_approval=True
)

# Approuver tools pour agent support
governance.approve_tool_for_agent(
    tool_name="search_knowledge_base",
    agent_id="agent-support-001",
    approved_by="admin@company.com"
)

governance.approve_tool_for_agent(
    tool_name="update_customer_record",
    agent_id="agent-support-001",
    approved_by="admin@company.com"
)

# Agent finance peut process_refund
governance.approve_tool_for_agent(
    tool_name="process_refund",
    agent_id="agent-finance-001",
    approved_by="finance-manager@company.com"
)

# Verifier permissions
can_search = governance.check_tool_permission("search_knowledge_base", "agent-support-001")
print(f"Agent support can search KB: {can_search}")  # True

can_delete = governance.check_tool_permission("delete_account", "agent-support-001")
print(f"Agent support can delete account: {can_delete}")  # False

# Audit trail
governance.get_audit_trail(agent_id="agent-support-001", days=7)

🔐 Agent Permission Model

Agent Type	Allowed Tools	Risk Level	Approval Required
Customer Support	Search KB, View orders, Update contact info	LOW-MEDIUM	Manager approval
Sales	CRM lookup, Create quote, Schedule demo	LOW-MEDIUM	Sales manager approval
Finance	Process refund, Generate invoice, View transactions	MEDIUM-HIGH	Finance manager approval
Admin	All tools including delete, modify settings	HIGH-CRITICAL	C-level approval

📊 Agent Marketplace Governance

python

class AgentMarketplace:
    """Marketplace interne pour partager agents securises"""

    def __init__(self, firestore_db):
        self.db = firestore_db

    def publish_agent(self, agent_config: dict, publisher: str):
        """Publier agent dans marketplace"""

        # Validation security
        self._validate_agent_security(agent_config)

        agent_doc = {
            **agent_config,
            "publisher": publisher,
            "published_at": datetime.now(),
            "status": "pending_review",  # Require review avant usage
            "downloads": 0,
            "ratings": []
        }

        agent_id = self.db.collection("agent_marketplace").add(agent_doc)[1].id
        print(f"✓ Agent published for review: {agent_id}")

        return agent_id

    def _validate_agent_security(self, agent_config: dict):
        """Valider securite agent avant publication"""

        # Check 1: Pas de hardcoded secrets
        if "api_key" in str(agent_config).lower():
            raise ValueError("Agent contains hardcoded API keys")

        # Check 2: Tools approuves uniquement
        tools = agent_config.get("tools", [])
        for tool in tools:
            tool_doc = self.db.collection("tool_registry").document(tool).get()
            if not tool_doc.exists:
                raise ValueError(f"Tool {tool} not approved in registry")

        # Check 3: System instruction pas malicieux
        system_instruction = agent_config.get("system_instruction", "")
        malicious_keywords = ["ignore", "disregard", "bypass"]
        if any(kw in system_instruction.lower() for kw in malicious_keywords):
            raise ValueError("System instruction contains suspicious keywords")

    def approve_agent(self, agent_id: str, reviewer: str):
        """Approuver agent apres review"""

        agent_ref = self.db.collection("agent_marketplace").document(agent_id)
        agent_ref.update({
            "status": "approved",
            "reviewed_by": reviewer,
            "reviewed_at": datetime.now()
        })

        print(f"✓ Agent {agent_id} approved by {reviewer}")

    def install_agent(self, agent_id: str, user: str):
        """Installer agent depuis marketplace"""

        agent_doc = self.db.collection("agent_marketplace").document(agent_id).get()

        if not agent_doc.exists:
            raise ValueError(f"Agent {agent_id} not found")

        agent_data = agent_doc.to_dict()

        if agent_data["status"] != "approved":
            raise ValueError(f"Agent not approved for installation")

        # Increment download counter
        self.db.collection("agent_marketplace").document(agent_id).update({
            "downloads": firestore.Increment(1)
        })

        # Logger installation
        self.db.collection("agent_installs").add({
            "agent_id": agent_id,
            "user": user,
            "installed_at": datetime.now()
        })

        print(f"✓ Agent {agent_id} installed for {user}")

        return agent_data

# Setup marketplace
marketplace = AgentMarketplace(db)

# Publier agent customer support
support_agent_config = {
    "name": "Customer Support Agent v2",
    "description": "Agent support avec acces KB et CRM",
    "model": "gemini-2.5-flash",
    "tools": ["search_knowledge_base", "update_customer_record"],
    "system_instruction": "Tu es un assistant support...",
}

agent_id = marketplace.publish_agent(support_agent_config, publisher="team-support@company.com")

# Review & approve
marketplace.approve_agent(agent_id, reviewer="security@company.com")

# Installer pour autre equipe
marketplace.install_agent(agent_id, user="team-sales@company.com")

📝 Audit Dashboard

python

def generate_agent_audit_report(governance: ToolGovernance, days: int = 30):
    """Generate rapport audit agent activities"""

    query = governance.db.collection(governance.audit_collection)
    cutoff = datetime.now() - timedelta(days=days)
    query = query.where("timestamp", ">=", cutoff)

    events = [doc.to_dict() for doc in query.stream()]

    # Statistiques
    total_calls = len(events)
    unique_agents = len(set(e.get("agent") for e in events))
    unique_tools = len(set(e.get("tool") for e in events))

    # Top tools
    tool_counts = {}
    for event in events:
        tool = event.get("tool")
        if tool:
            tool_counts[tool] = tool_counts.get(tool, 0) + 1

    print(f"=== AGENT AUDIT REPORT (Last {days} days) ===\n")
    print(f"Total tool calls: {total_calls}")
    print(f"Active agents: {unique_agents}")
    print(f"Tools used: {unique_tools}\n")

    print("Top 5 tools:")
    for tool, count in sorted(tool_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {tool}: {count} calls")

    # Detect anomalies (high-risk tool usage)
    high_risk_calls = [
        e for e in events
        if e.get("tool") in ["process_refund", "delete_account"]
    ]

    if high_risk_calls:
        print(f"\n⚠️  {len(high_risk_calls)} high-risk tool calls detected:")
        for call in high_risk_calls[:10]:
            print(f"  {call['timestamp']}: {call['tool']} by {call['agent']}")

# Generate rapport mensuel
generate_agent_audit_report(governance, days=30)

Agent governance protege votre entreprise. Tool registry avec risk levels = control granulaire. Permission whitelist = least privilege principle. Audit trail complet = compliance & forensics. Marketplace interne = reutilisation securisee agents. En prod, ajoutez : rate limiting par agent, anomaly detection (calls inhabituels), quarterly access review.

Gemma & Open Source

⏱ 25 min Intermediaire

🎯 Objectifs d'apprentissage

Comprendre Gemma 3/3n et use cases
Deployer Gemma on-device (Nano)
Fine-tuner Gemma pour domaine specifique
Utiliser Gemma Scope 2 pour interpretabilite

🌟 Famille Gemma (2026)

Modele	Taille	Use Case	Deployment
Gemma 3 27B	27B params	Self-hosting, fine-tuning custom	GKE, on-prem, cloud VM
Gemma 3 9B	9B params	Edge servers, latency-critical	Edge TPU, GPU servers
Gemma 3 2B	2B params	Mobile apps, IoT devices	Android, iOS, Raspberry Pi
Gemma Nano	1.8B params	On-device inference (offline)	Smartphones, laptops

💡 Gemma 3n = Nano-optimized

Gemma 3n est version quantized 4-bit pour deployment on-device. Inference 3-5x plus rapide que Gemma 3 standard, avec qualite similaire.

📱 Deploy Gemma Nano On-Device

python

# Installation
# pip install mediapipe

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import text

# Download Gemma Nano model (1.8B, quantized 4-bit)
# https://ai.google.dev/gemma/docs/get_started

# Initialize Gemma Nano
base_options = python.BaseOptions(model_asset_path='gemma_nano_2b_quantized.bin')
options = text.TextGeneratorOptions(base_options=base_options, max_tokens=256)
generator = text.TextGenerator.create_from_options(options)

# Generate on-device (offline)
prompt = "Explique la photosynthese en 2 phrases"
result = generator.generate(prompt)

print(result.text)
# → Inference 100% locale, pas besoin internet
# → Latence ~500ms sur smartphone recent
# → Privacy total (donnees ne quittent pas device)

# Use cases on-device :
# - Keyboard suggestions
# - Voice assistant offline
# - Document summarization (emails, PDFs)
# - Privacy-sensitive apps (medical, finance)

🎓 Fine-Tuning Gemma

python

# Fine-tune Gemma 3 9B pour domaine medical

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch

# 1. Charger Gemma 3 9B
model_name = "google/gemma-3-9b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Preparer dataset medical (exemple)
# Format : {"prompt": "...", "completion": "..."}
dataset = load_dataset("medical-qa-dataset")  # Votre dataset

def preprocess_function(examples):
    inputs = [f"Question: {q}\nAnswer:" for q in examples["prompt"]]
    targets = examples["completion"]

    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=256, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# 3. Configurer fine-tuning
training_args = TrainingArguments(
    output_dir="./gemma-medical-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    fp16=True,  # Mixed precision
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# 4. Fine-tune (4-8h sur 4x A100)
trainer.train()

# 5. Sauvegarder modele fine-tuned
model.save_pretrained("./gemma-medical-finetuned")
tokenizer.save_pretrained("./gemma-medical-finetuned")

# 6. Inference avec modele fine-tuned
finetuned_model = AutoModelForCausalLM.from_pretrained("./gemma-medical-finetuned")
prompt = "Question: Quels sont les symptomes du diabete de type 2 ?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = finetuned_model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0]))

# ECONOMIE vs Gemini API :
# Fine-tuning : $500-2000 one-time (compute)
# Self-hosting : $100-500/month (VM/GPU)
# vs Gemini API : $500-5000/month si volume eleve
# → ROI positif si >10M tokens/month

🔍 Gemma Scope 2 : Interpretabilite

python

# pip install gemma-scope

from gemma_scope import GemmaScope, FeatureVisualizer

# Load Gemma 3 avec Scope
scope = GemmaScope(
    model_name="gemma-3-9b",
    sae_layer=15  # Sparse AutoEncoder layer 15
)

# Analyser activation pour prompt
prompt = "Le vaccin COVID cause l'autisme"  # Misinformation

activations = scope.get_activations(prompt)
top_features = scope.get_top_features(activations, k=20)

print("=== TOP ACTIVATED FEATURES ===")
for feature_id, strength in top_features:
    desc = scope.get_feature_description(feature_id)
    print(f"Feature {feature_id}: {desc} ({strength:.3f})")

# SORTIE EXEMPLE :
# Feature 1892: Medical misinformation (0.912) ⚠️
# Feature 3405: Vaccine-related content (0.854)
# Feature 8721: Controversial claims (0.743)
# → Model detecte misinformation !

# Visualiser features actives
visualizer = FeatureVisualizer(scope)
visualizer.plot_feature_activation(prompt, top_k=10)
visualizer.save("feature_activation.png")

# Use cases Gemma Scope :
# 1. Detecter biais dans reponses
# 2. Expliquer pourquoi modele genere telle reponse
# 3. Identifier features problematiques pour fine-tuning
# 4. Audit & compliance (expliquer decisions IA)

🛡️ Gemma & AI Safety

🔒 Gemma Safety Features

Safety filters : Pre-trained pour bloquer harmful content
Open weights : Auditabilite complete du modele
Responsible AI Toolkit : Outils pour evaluer biais, toxicite
Gemma Scope : Interpretabilite via sparse autoencoders

python

# Evaluer toxicite avec Gemma
from transformers import pipeline

# Charger Gemma 3 2B pour classification
classifier = pipeline(
    "text-classification",
    model="google/gemma-3-2b-toxicity-classifier"
)

# Tester prompts
prompts = [
    "Comment installer Python ?",
    "Je deteste tous les [groupe]",  # Toxic
    "Explique la photosynthese"
]

for prompt in prompts:
    result = classifier(prompt)[0]
    label = result['label']  # TOXIC ou NON_TOXIC
    score = result['score']

    print(f"Prompt: {prompt[:50]}")
    print(f"  → {label} (confidence: {score:.2f})\n")

# Integration dans pipeline production
def safe_generate(prompt: str, model):
    """Generate avec toxicity check"""

    # Pre-check
    toxicity = classifier(prompt)[0]
    if toxicity['label'] == 'TOXIC' and toxicity['score'] > 0.8:
        return "Je ne peux pas repondre a cette requete."

    # Generate
    response = model.generate(prompt)

    # Post-check
    response_toxicity = classifier(response)[0]
    if response_toxicity['label'] == 'TOXIC':
        return "Reponse filtree pour contenu inapproprie."

    return response

🌐 Gemma Ecosystem

Tool	Purpose	Link
Gemma.cpp	Inference C++ optimise (CPU)	github.com/google/gemma.cpp
Gemma Android	SDK Android pour on-device	ai.google.dev/gemma/docs/android
Gemma Scope	Interpretabilite SAE	github.com/google-research/gemma-scope
Gemma Safety	Toxicity/bias evaluation	github.com/google/responsible-ai
Kaggle Models	Download weights (free)	kaggle.com/models/google/gemma

Gemma = alternative open-source a Gemini pour use cases self-hosted. Gemma 3 27B competitive avec modeles propietaires pour domaines specifiques apres fine-tuning. Gemma Nano revolutionne on-device AI (privacy, offline). Fine-tuning ROI positif si >10M tokens/mois. Gemma Scope = interpretabilite unique dans industrie. Utilisez Gemma pour : data sensible (medical, finance), offline apps, cost optimization.

Ecosysteme Google AI

⏱ 20 min Apercu

🎯 Objectifs d'apprentissage

Explorer NotebookLM et Workspace AI
Comprendre Code Assist et Astra DB
Decouvrir Mariner, Jules et AI Overviews
Integrer ecosysteme Google AI

🌐 Carte Ecosysteme Google AI

📚 NotebookLM : AI Research Assistant

NotebookLM transforme vos documents en assistant IA interactif.

python

# NotebookLM via API (preview)
# pip install google-notebooklm

from google.notebooklm import NotebookLM

# Creer notebook
notebook = NotebookLM.create(name="Product Documentation")

# Upload sources (PDFs, docs, URLs)
notebook.add_source(file="product_manual.pdf")
notebook.add_source(file="api_docs.md")
notebook.add_source(url="https://docs.product.com/guide")

# Query avec context automatique
response = notebook.query(
    "Comment configurer authentication OAuth ?"
)

print(response.answer)
# → Reponse synthetisee depuis les 3 sources
# → Citations automatiques vers sources

print("\nSources :")
for citation in response.citations:
    print(f"- {citation.source}: {citation.excerpt}")

# Use cases :
# - Onboarding nouveaux employes (docs entreprise)
# - Support client (knowledge base)
# - Research (papers scientifiques)
# - Audit & compliance (reglementations)

💼 Workspace AI : Gmail, Docs, Sheets

Product	AI Feature	Example
Gmail	Help me write (email drafting)	"Draft email to decline meeting professionally"
Docs	Help me write (content generation)	"Write product launch announcement, tone: excited"
Sheets	Help me organize (data analysis)	"Create pivot table summarizing sales by region"
Slides	Create presentation	"Create 10-slide deck on Q4 results with charts"
Meet	Real-time transcription	Auto-generate meeting notes with action items

💡 Workspace AI Enterprise

Workspace AI utilise Gemini 1.5 Pro par default. Donnees ne sont PAS utilisees pour training. Available avec Workspace Enterprise Plus ($30/user/month).

💻 Code Assist : AI Coding

python

# Cloud Code Assist = Gemini Code Assist pour GCP
# Integration IDE : VS Code, IntelliJ, Cloud Shell Editor

# Exemples use cases :

# 1. Code generation
# Prompt: "Create Cloud Function to resize images uploaded to GCS"
# → Generates complete Python code with error handling

# 2. Code explanation
# Select complex code block → "Explain this code"
# → Natural language explanation ligne par ligne

# 3. Code migration
# "Convert this App Engine app to Cloud Run"
# → Generates Dockerfile, deployment config, migration guide

# 4. Debugging
# Paste error → "How to fix this error?"
# → Root cause analysis + solution

# 5. Security review
# "Check this code for security vulnerabilities"
# → Identifies SQL injection, XSS, secrets in code

# Pricing :
# Code Assist : $19/user/month
# Alternative : GitHub Copilot ($10/month), Claude Code (free beta)

🗄️ Astra DB : Vector Database

Astra DB (DataStax) = managed vector database optimise pour RAG avec Gemini.

python

# pip install astrapy

from astrapy.client import DataAPIClient
from vertexai.language_models import TextEmbeddingModel

# Connect to Astra DB
client = DataAPIClient(token="AstraCS:xxx")
database = client.get_database("https://xxx.apps.astra.datastax.com")
collection = database.get_collection("documents")

# Embed documents avec Gemini
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")

documents = [
    "Gemini 2.5 Pro released February 2026",
    "Context caching reduces cost by 90%",
    "Flash-8B is 75x cheaper than Pro"
]

for doc in documents:
    # Generate embedding
    embedding = embedding_model.get_embeddings([doc])[0].values

    # Insert dans Astra
    collection.insert_one({
        "text": doc,
        "embedding": embedding
    })

# Vector search
query = "How to reduce Gemini costs?"
query_embedding = embedding_model.get_embeddings([query])[0].values

results = collection.vector_find(
    vector=query_embedding,
    limit=3
)

for result in results:
    print(f"Score: {result['$similarity']:.3f} - {result['text']}")

# Astra advantages :
# - Latency <10ms (global distribution)
# - Auto-scaling (serverless)
# - Integrated with Langchain, LlamaIndex
# - Free tier : 80GB storage

🌊 Mariner : Web Agent

Mariner = agent Gemini qui navigue web automatiquement.

python

# Mariner (preview, disponible via Chrome extension)

# Use cases :
# 1. Research automation
#    "Find 10 competitors in AI coding assistants space with pricing"
#    → Mariner visite sites, extrait pricing, genere tableau

# 2. E-commerce
#    "Compare prices for iPhone 15 Pro on Amazon, BestBuy, Target"
#    → Mariner navigue sites, compare prix en temps reel

# 3. Travel booking
#    "Find cheapest flight Paris to NYC, March 15-22"
#    → Mariner compare Google Flights, Kayak, Expedia

# 4. Data collection
#    "Scrape product reviews from top 50 items on category page"
#    → Mariner navigue pages, extrait reviews, structure data

# Architecture :
# User query → Gemini 2.5 Pro → Plans actions → Mariner agent
#   → Executes browser actions (click, scroll, extract)
#   → Returns structured results

# Privacy : Mariner runs locally in browser, pas de data sent to Google

👨‍💻 Jules : AI Code Agent

Jules = agent autonome pour fix bugs et implement features.

bash

# Jules integration GitHub (preview)

# Workflow :
# 1. Create GitHub issue : "Fix: API returns 500 on invalid input"
# 2. Assign to @jules-ai
# 3. Jules :
#    - Reads issue description
#    - Analyzes codebase
#    - Identifies root cause
#    - Fixes bug
#    - Writes tests
#    - Creates PR with explanation
# 4. Human review → Merge

# Example issue :
# Title: "Add caching to reduce Gemini API costs"
# Description: "Implement Redis cache for repeated queries"

# Jules actions :
# - Reads current code
# - Installs Redis client
# - Implements cache layer avec TTL
# - Adds monitoring metrics
# - Creates PR avec benchmark results

# Similar tools : Devin, Cursor Agent, GitHub Copilot Workspace

🔍 AI Overviews : Search with Gemini

AI Overviews = Google Search integre Gemini pour reponses directes.

python

# AI Overviews API (preview)
# pip install google-search-ai

from google.search import AIOverviewsClient

client = AIOverviewsClient()

# Query avec AI-generated overview
query = "How to reduce Gemini API costs in production?"

result = client.search(query)

# Overview = Gemini-generated summary
print("=== AI OVERVIEW ===")
print(result.overview.text)

# Traditional search results
print("\n=== SOURCES ===")
for source in result.sources:
    print(f"- {source.title}: {source.url}")

# SORTIE EXEMPLE :
# === AI OVERVIEW ===
# To reduce Gemini API costs in production:
# 1. Use context caching for repeated content (-90% cost)
# 2. Route simple queries to Flash-8B instead of Pro (-75x cost)
# 3. Use Batch API for non-urgent workloads (-50% cost)
# 4. Compress prompts and control max_output_tokens
#
# === SOURCES ===
# - Vertex AI Pricing: https://cloud.google.com/vertex-ai/pricing
# - Context Caching Guide: https://...
# - Best Practices: https://...

# Use case : Integrate AI Overviews dans apps pour rich answers

Ecosysteme Google AI est vaste et en expansion rapide. NotebookLM revolutionne research, Workspace AI boost productivite quotidienne. Code Assist accelere dev, Astra DB optimise RAG. Mariner automatise web browsing, Jules fix bugs autonomously. AI Overviews transforme search. En 2026, convergence Gemini + tools Google = productivity multiplier. Explorez, experimentez, integrez dans workflows.

Tendances & Futur de l'IA

⏱ 20 min Vision

🎯 Objectifs d'apprentissage

Comprendre Universal Agent vision
Explorer Generative UI paradigm
Anticiper Personal Intelligence evolution
Preparer architecture pour futur multimodal

🌟 Universal Agent : One Agent to Rule Them All

Vision 2027-2030 : Agent unique capable d'accomplir toute tache digitale.

🚀 Universal Agent Capabilities (2027+)

Autonomy : Complete tasks end-to-end sans intervention humaine
Context retention : Memoire long-term de toutes interactions
Multi-tool orchestration : Utilise 100+ tools selon besoin
Learning : Apprend de chaque interaction, s'ameliore
Personalization : Adapte comportement a chaque user

🎨 Generative UI : UI qui S'Adapte

Paradigme shift : UI n'est plus statique, elle est generee dynamiquement par IA.

python

# Generative UI avec Gemini (concept 2026)

from vertexai.generative_models import GenerativeModel

model = GenerativeModel("gemini-2.5-pro")

# User request
user_request = "Je veux dashboard pour suivre mes couts Vertex AI"

# Generate UI dynamiquement
ui_generation_prompt = f"""
Generate React component code for this user request: "{user_request}"

Requirements:
- Use Recharts for visualizations
- Fetch data from /api/vertex-costs endpoint
- Responsive design with Tailwind
- Include filters: date range, model type
- Show total cost, cost by model (pie chart), daily trend (line chart)

Return ONLY valid React JSX code.
"""

response = model.generate_content(ui_generation_prompt)
react_code = response.text

# Save generated component
with open("CostDashboard.jsx", "w") as f:
    f.write(react_code)

print("✓ UI component generated!")

# Deploy automatiquement
# import subprocess
# subprocess.run(["npm", "run", "build"])
# subprocess.run(["gcloud", "run", "deploy", "cost-dashboard", ...])

# RESULTAT : Dashboard custom genere en <5 secondes
# → Pas besoin developer, designer
# → UI parfaitement adaptee a user request
# → Iterations rapides : "Ajoute export CSV" → regenere component

🧠 Personal Intelligence : AI qui Vous Connait

Personal Intelligence = Agent IA avec memoire complete de votre vie digitale.

python

# Personal Intelligence architecture (conceptuel)

class PersonalIntelligence:
    """Agent IA avec memoire long-term et personnalisation"""

    def __init__(self, user_id: str):
        self.user_id = user_id
        self.memory = self._load_memory()  # All interactions history
        self.preferences = self._load_preferences()
        self.context = self._load_context()  # Calendar, emails, docs

    def process_request(self, request: str):
        """Process request avec contexte personnel complet"""

        # Enrichir request avec contexte
        enriched_prompt = f"""
User: {self.user_id}
Request: {request}

Personal context:
- Preferences: {self.preferences}
- Recent interactions: {self.memory[-10:]}
- Current calendar: {self.context['calendar_today']}
- Work projects: {self.context['active_projects']}

Generate personalized response considering all context.
"""

        response = self.model.generate_content(enriched_prompt)

        # Save interaction to memory
        self.memory.append({
            "timestamp": datetime.now(),
            "request": request,
            "response": response.text
        })
        self._save_memory()

        return response.text

# Examples use cases :

# 1. "Schedule meeting with Sarah"
# → Agent knows Sarah's email, checks both calendars, proposes 3 slots

# 2. "Summarize what I missed this morning"
# → Agent reads emails, Slack, calendar, genere summary personnalise

# 3. "Draft response to client email"
# → Agent knows client history, your writing style, projet context

# 4. "Should I approve this expense?"
# → Agent knows budget, spending patterns, company policies

# Privacy considerations :
# - All data encrypted at rest
# - User control over what data is accessible
# - Opt-in pour chaque data source
# - Delete memory on demand

📱 Multimodal Native : Beyond Text

Futur : IA comprehend simultanément text, image, audio, video, code, 3D.

python

# Multimodal use case futuriste

# Input : Voice + Screen share + Camera
# "Aide-moi a debugger cette app, voici mon ecran et le code"

from vertexai.generative_models import GenerativeModel, Part

model = GenerativeModel("gemini-3.0-ultra")  # Hypothetical 2027 model

# Multimodal input
response = model.generate_content([
    Part.from_audio_file("voice_explanation.wav"),  # Voice explanation
    Part.from_image_file("screenshot_error.png"),   # Screenshot with error
    Part.from_video_file("screen_recording.mp4"),   # Screen recording
    Part.from_text(open("app.py").read()),          # Source code
    "Debug this application and suggest fixes"
])

# Output : Multimodal response
# - Text explanation of bug
# - Code diff with fixes
# - Video tutorial showing how to fix
# - Voice explanation of root cause

print(response.text)  # Textual explanation

# Access other modalities
if response.video:
    response.video.save("fix_tutorial.mp4")

if response.audio:
    response.audio.save("explanation.mp3")

# Use case futur : "Design a logo for my company"
# → Input : voice description + mood board images
# → Output : 5 logo variations (SVG + PNG) + usage guidelines PDF

🔮 On-Device AI : Privacy First

Tendance 2026-2030 : Modeles puissants executent localement sur devices.

📱 On-Device AI Benefits

Privacy : Donnees ne quittent jamais device
Latency : Inference instant (<100ms)
Offline : Fonctionne sans internet
Cost : Pas de frais API
Scale : Millions users sans infra backend

🌐 Tendances 2026-2030

Tendance	Timeline	Impact
Universal Agent	2027-2028	1 agent remplace 100+ apps specialisees
Generative UI	2026-2027	Developpeurs frontend reduits de 50%
Personal Intelligence	2027-2029	Productivite personnelle +30-50%
Multimodal natif	2026-2027	Text-only devient obsolete
On-device AI	2026-2028	Cloud AI complementaire, pas primaire
AI-first OS	2028-2030	OS traditionnels remplaces par AI OS

Futur IA est multimodal, autonome, personnel, on-device. Universal Agent remplacera apps specialisees. Generative UI eliminera besoin UI designers pour use cases standards. Personal Intelligence deviendra extension naturelle cognition humaine. Preparez architectures pour ce futur : APIs modulaires, data ownership user-centric, privacy by design. 2026 = debut transformation, 2030 = monde different.

Projet Final : Architecture Enterprise Complete

⏱ 4-6 heures Projet

🎯 Objectif du Projet

Concevoir et documenter une architecture Gemini enterprise complete pour un cas d'usage reel, integrant tous les concepts de Phase 4.

📋 Cahier des Charges

🏢 Scenario : E-Commerce Customer Service Platform

Entreprise : TechMart, e-commerce 50M users, 10M transactions/mois

Besoin : Platform support client AI-powered avec agents autonomes

Contraintes :

Budget : $10,000/mois pour Vertex AI
SLA : 99.9% uptime, <2s latency p95
Compliance : GDPR, PCI-DSS
Scale : Support 100,000 conversations/jour
Langues : FR, EN, ES, DE

📐 Livrables Requis

1. Architecture Diagram (30 min)

Creer diagram architecture complete incluant :

Multi-model routing (Pro/Flash/Flash-8B)
RAG avec vector database
Agent system avec tools
Cache strategy
Monitoring & alerting
Security layers (VPC-SC, DLP, IAM)

yaml

# architecture.yaml - Exemple structure

components:
  frontend:
    type: Cloud Run
    replicas: 3-10 (autoscaling)
    regions: [us-central1, europe-west1]

  model_router:
    type: Cloud Run
    logic: |
      - Simple queries (FAQ) → Flash-8B
      - Standard (order status) → Flash
      - Complex (complaints) → Pro
    fallback: Pro

  rag_system:
    vector_db: Vertex AI Vector Search
    embeddings: text-embedding-004
    chunk_size: 512 tokens
    top_k: 5

  agent_system:
    framework: Vertex AI Agent Builder
    tools:
      - search_knowledge_base (LOW risk)
      - lookup_order (MEDIUM risk)
      - process_refund (HIGH risk)
      - update_customer_info (MEDIUM risk)
    governance: Tool approval workflow

  caching:
    explicit: System instructions (5000 tokens, TTL 60min)
    implicit: Auto-caching prefixes >1024 tokens

  monitoring:
    metrics:
      - Request count by model
      - Latency p50/p95/p99
      - Cost per conversation
      - Error rate
      - Safety filter triggers
    dashboards: Looker Studio + BigQuery
    alerts:
      - Budget >90% → Email + PagerDuty
      - Error rate >2% → PagerDuty
      - Latency p95 >3s → Slack

  security:
    vpc_sc: Perimeter around Vertex AI
    dlp: Scan prompts for PII before sending
    iam:
      - agents: roles/aiplatform.user
      - developers: roles/aiplatform.admin
    cmek: Customer-managed keys for data
    audit: Data Access logs enabled

2. Implementation Plan (45 min)

Document plan implementation detaille :

markdown

# Implementation Plan

## Phase 1 : Foundation (Week 1-2)
- [ ] Setup GCP project avec VPC-SC
- [ ] Configure IAM roles et service accounts
- [ ] Deploy base infrastructure (Cloud Run, Firestore)
- [ ] Implement model router avec Flash-8B/Flash/Pro
- [ ] Setup monitoring dashboard (BigQuery + Looker)

## Phase 2 : RAG System (Week 3-4)
- [ ] Ingest knowledge base (product docs, FAQs)
- [ ] Setup Vertex AI Vector Search
- [ ] Implement chunking strategy (512 tokens)
- [ ] Test retrieval quality (measure precision@5)
- [ ] Optimize embeddings model

## Phase 3 : Agent System (Week 5-6)
- [ ] Register tools dans tool registry
- [ ] Implement tool approval workflow
- [ ] Deploy agents avec Vertex AI Agent Builder
- [ ] Configure agent permissions
- [ ] Test agent workflows end-to-end

## Phase 4 : Optimization (Week 7-8)
- [ ] Implement context caching (system instructions)
- [ ] Configure batch processing for analytics
- [ ] Optimize prompts (-30% tokens)
- [ ] Setup cost attribution labels
- [ ] Run load tests (100K requests/day)

## Phase 5 : Security & Compliance (Week 9-10)
- [ ] Enable DLP for PII detection
- [ ] Configure safety settings (BLOCK_LOW_AND_ABOVE)
- [ ] Implement audit logging
- [ ] GDPR compliance review
- [ ] Security penetration testing

## Phase 6 : Production Deploy (Week 11-12)
- [ ] Canary deploy (10% traffic)
- [ ] Monitor metrics for 3 days
- [ ] Rollout to 50%
- [ ] Full production (100%)
- [ ] Post-deploy monitoring 2 weeks

3. Cost Model (45 min)

Calculer couts detailles :

python

# cost_model.py

class CostModel:
    def __init__(self):
        # Pricing ($/1M tokens)
        self.prices = {
            "flash-8b": {"input": 0.04, "output": 0.16},
            "flash": {"input": 0.15, "output": 0.60},
            "pro": {"input": 3.00, "output": 12.00},
            "cache": 0.015,
            "embedding": 0.025,
        }

    def calculate_monthly_cost(self,
                                conversations_per_day: int,
                                avg_messages_per_conversation: int):
        """Calculer cout mensuel"""

        total_conversations = conversations_per_day * 30

        # Model distribution (apres routing)
        flash_8b_pct = 0.60  # 60% simple queries
        flash_pct = 0.30     # 30% standard
        pro_pct = 0.10       # 10% complex

        # Tokens par message
        system_instruction_tokens = 5000  # Cached
        user_input_tokens = 300
        rag_context_tokens = 2000
        output_tokens = 150

        # Total messages
        total_messages = total_conversations * avg_messages_per_conversation

        # Cost breakdown
        costs = {}

        # 1. System instruction (cached)
        cache_cost = (system_instruction_tokens * total_messages / 1_000_000) * self.prices["cache"]
        costs["cache"] = cache_cost

        # 2. Embeddings (RAG)
        embedding_cost = (user_input_tokens * total_messages / 1_000_000) * self.prices["embedding"]
        costs["embeddings"] = embedding_cost

        # 3. LLM calls
        for model, pct in [("flash-8b", flash_8b_pct), ("flash", flash_pct), ("pro", pro_pct)]:
            model_messages = total_messages * pct
            input_tokens = user_input_tokens + rag_context_tokens

            input_cost = (input_tokens * model_messages / 1_000_000) * self.prices[model]["input"]
            output_cost = (output_tokens * model_messages / 1_000_000) * self.prices[model]["output"]

            costs[f"{model}_input"] = input_cost
            costs[f"{model}_output"] = output_cost

        total_cost = sum(costs.values())

        return {
            "total_monthly": total_cost,
            "cost_per_conversation": total_cost / total_conversations,
            "breakdown": costs
        }

# Calculate pour TechMart
model = CostModel()
result = model.calculate_monthly_cost(
    conversations_per_day=100_000,
    avg_messages_per_conversation=5
)

print(f"=== COST MODEL ===")
print(f"Total monthly: ${result['total_monthly']:.2f}")
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"\nBreakdown:")
for item, cost in result['breakdown'].items():
    print(f"  {item}: ${cost:.2f}")

# Expected output :
# Total monthly: ~$8,500
# Cost per conversation: ~$0.0028
# → Dans budget $10,000/mois

4. ADR Documentation (30 min)

Ecrire 3 ADRs pour decisions cles :

markdown

# ADR-001: Multi-Model Routing Strategy

## Status
ACCEPTED

## Context
TechMart needs to support 100K conversations/day within $10K/month budget.
Using only Pro would cost ~$45K/month. Using only Flash-8B would degrade quality.

## Decision
Implement intelligent model routing:
- Flash-8B (60% traffic): FAQ, simple queries
- Flash (30% traffic): Order status, standard support
- Pro (10% traffic): Complex complaints, escalations

Classifier: Flash-8B with 100-token prompt.

## Consequences
### Positive
- Cost reduced from $45K to $8.5K/month (-81%)
- Quality maintained (85% CSAT vs 87% all-Pro)
- Classifier cost negligible ($50/month)

### Negative
- Added complexity (router service)
- Potential misclassification (~5% rate)

### Mitigation
- Monitor classification accuracy
- Fallback to Pro on errors
- Weekly review misclassified queries

---

# ADR-002: Context Caching for System Instructions

## Status
ACCEPTED

## Context
System instruction contains 5000 tokens (product catalog, policies, FAQs).
Without caching: $0.00075 per message × 15M messages = $11,250/month just for system instruction.

## Decision
Enable explicit context caching with 60-min TTL.
Pre-warm cache every 55 minutes to avoid cold starts.

## Consequences
### Positive
- Cache cost: $1,125/month (vs $11,250 without)
- Savings: $10,125/month (-90%)
- No latency impact

### Negative
- Cache management complexity
- Risk of stale cache if system instruction changes

### Mitigation
- Cache invalidation on system instruction update
- Monitoring cache hit rate (target >95%)

---

# ADR-003: DLP for PII Protection

## Status
ACCEPTED

## Context
GDPR requires protecting customer PII.
Risk: Customers may share SSN, credit cards in chat.

## Decision
Implement Cloud DLP to scan all user messages before sending to Gemini.
Redact: SSN, credit cards, emails, phone numbers.

## Consequences
### Positive
- GDPR compliance
- Protect customer privacy
- Prevent PII leakage to LLM

### Negative
- Added latency: +50-100ms per message
- Cost: $0.000015 per message = $450/month

### Mitigation
- Async DLP (non-blocking for non-PII messages)
- Cache DLP results for repeated messages

5. Security Checklist (20 min)

Security Control	Implementation	Status
VPC Service Controls	Perimeter around Vertex AI, block data exfiltration	✅ Required
DLP PII Scanning	Scan all prompts, redact SSN/CC/email	✅ Required
IAM Least Privilege	Service accounts with minimal roles	✅ Required
Audit Logging	Data Access logs for all Vertex AI calls	✅ Required
Safety Settings	BLOCK_LOW_AND_ABOVE for all categories	✅ Required
CMEK	Customer-managed encryption keys	⚠️ Optional (highly recommended)
Private Service Connect	Vertex AI access via private endpoint	⚠️ Optional (if ultra-secure network)
Secrets Management	API keys in Secret Manager	✅ Required

6. Monitoring Dashboard (20 min)

Definir metriques et alertes :

yaml

# monitoring.yaml

dashboards:
  overview:
    metrics:
      - Total conversations (24h)
      - Active conversations (realtime)
      - Avg response time
      - Cost today vs budget
      - Error rate

  performance:
    metrics:
      - Latency p50/p95/p99 by model
      - Cache hit rate
      - RAG retrieval quality (precision@5)
      - Agent tool call success rate

  cost:
    metrics:
      - Cost by model (pie chart)
      - Daily cost trend (30 days)
      - Cost per conversation
      - Budget utilization (%)

  quality:
    metrics:
      - CSAT score
      - Resolution rate
      - Escalation rate
      - Safety filter blocks

alerts:
  - name: Budget Alert
    condition: daily_cost > $400
    channels: [email, slack]
    severity: warning

  - name: Error Rate High
    condition: error_rate > 2%
    channels: [pagerduty]
    severity: critical

  - name: Latency Degradation
    condition: p95_latency > 3000ms
    channels: [slack]
    severity: warning

  - name: Cache Hit Rate Low
    condition: cache_hit_rate < 90%
    channels: [email]
    severity: info

  - name: Safety Filter Spike
    condition: safety_blocks > 100/hour
    channels: [email, slack]
    severity: warning

📊 Criteres Evaluation

Critere	Points	Description
Architecture	25	Completude, coherence, scalabilite
Cost Optimization	20	Model routing, caching, cost model realiste
Security	20	VPC-SC, DLP, IAM, compliance
Implementation Plan	15	Realisme, timeline, dependencies
Monitoring	10	Metriques pertinentes, alertes actionable
Documentation	10	ADRs, diagrammes, clarte

Total : 100 points

Seuil validation : 70/100

Ce projet final synthetise tout Phase 4. Architecture solide = fondation succes production. Prenez temps pour bien concevoir avant implementer. Validez assumptions avec calculs couts. Documentez decisions (ADRs). En entreprise, ce type design doc = prerequisite avant dev sprint. Qualite architecture determine succes long-terme projet IA.

Examen Final & Certification

⏱ 60 min Certification

🎯 Objectif

Valider maitrise complete Phase 4 : Deploiement Enterprise, FinOps, Gouvernance.

Format : 30 questions QCM + validation projet final

Duree : 60 minutes

Seuil reussite : 24/30 (80%)

📝 Examen Final : 30 Questions

1. Quelle difference principale entre AI Studio et Vertex AI ?

Prix differents

Modeles differents

Vertex AI offre VPC-SC, IAM enterprise, SLA

Aucune difference

2. Pour deployer API Gemini serverless, quelle solution ?

Compute Engine

Cloud Run

GKE Autopilot

App Engine

3. Comment eliminer cold starts Cloud Run ?

min-instances > 0

max-instances > 100

Augmenter memoire

Impossible

4. VPC Service Controls permet de :

Reduire couts

Accelerer inference

Empecher exfiltration donnees

Activer caching

5. Ou stocker API keys securise ?

Code source Git

Secret Manager

Fichier .env

Variables Cloud Run

6. DLP (Data Loss Prevention) sert a :

Detecter et masquer PII avant Gemini

Sauvegarder donnees

Monitorer couts

Accelerer requetes

7. Tiered pricing Gemini signifie :

Prix augmente avec volume

Prix varie par region

Remise entreprise

Prix divise par 2 au-dela 200K tokens

8. Context caching reduit cout de combien ?

-50%

-75%

-90% sur partie cachee

-99%

9. Model routing intelligent peut economiser :

60-70%

20-30%

10-15%

5-10%

10. Batch API offre reduction cout de :

-30%

-50%

-70%

-90%

11. Context caching rentable des combien requetes ?

2 requetes (ROI immediate)

10 requetes

100 requetes

1000 requetes

12. Flash-8B coute combien vs Pro ?

2x moins cher

10x moins cher

75x moins cher

Meme prix

13. BigQuery billing export est :

Payant ($100/mois)

Gratuit et essentiel FinOps

Optionnel

Uniquement Enterprise

14. Budget alerts recommandes a :

100%

80% et 100%

50%, 90%, 100%

Pas necessaire

15. Output tokens coutent combien vs input (Flash) ?

Meme prix

4x plus cher

2x plus cher

10x plus cher

16. Les 7 principes Google AI incluent :

Be socially beneficial, Avoid bias, Safety, Privacy

Profit, Growth, Innovation

Speed, Cost, Quality

Open source, Free, Fast

17. Safety settings BLOCK_LOW_AND_ABOVE signifie :

Bloquer seulement haute probabilite

Pas de blocage

Bloquer low, medium, high (strict)

Bloquer uniquement low

18. Gemma Scope permet :

Accelerer inference

Interpretabilite modele via SAE

Reduire couts

Fine-tuning automatique

19. Model lifecycle stages sont :

Dev → Staging → Production → Deprecation

Planning → Build → Deploy

Alpha → Beta → GA

Train → Test → Validate

20. ADR (Architecture Decision Record) documente :

Code source

API endpoints

Decisions architecture avec context/consequences

User stories

21. Tool governance HIGH risk requiert :

Pas d'approbation

Approval manager + audit trail

Auto-approve

Blocage total

22. Agent audit trail doit logger :

Tous appels tools avec params/results

Uniquement erreurs

Rien (privacy)

Uniquement couts

23. Gemma 3 vs Gemini difference principale :

Gemma plus rapide

Gemma moins cher API

Gemma open-source, self-hosted

Aucune difference

24. Gemma Nano use case principal :

Cloud servers

On-device inference (smartphones)

Batch processing

Fine-tuning

25. NotebookLM permet :

Transformer docs en assistant IA interactif

Executer code Python

Heberger modeles

Gerer budgets

26. Astra DB est optimise pour :

Analytics

Transactions

Vector search RAG avec Gemini

Data warehousing

27. Universal Agent vision 2027+ :

Agent par use case

Un agent pour toutes taches digitales

Pas d'agents, seulement APIs

Agents hardware uniquement

28. Generative UI paradigm shift :

UI generee dynamiquement par IA selon besoin

UI statique predefined

Pas d'UI

UI 3D realite virtuelle

29. On-device AI principal avantage :

Plus rapide que cloud

Moins cher que cloud

Privacy totale + offline

Qualite superieure

30. Canary deployment signifie :

Deploy simultane tous serveurs

Deploy 10% traffic, monitor, puis augmenter progressivement

Deploy uniquement nuit

Deploy avec rollback automatique

🎓 Obtenir la Certification

📜 Certification Architecte Gemini

Criteres validation :

✅ Examen final : 24/30 minimum (80%)
✅ Projet final : 70/100 minimum
✅ Toutes lecons Phase 4 completees

Certification delivree :

PDF certificate avec QR code verification
Badge LinkedIn "Architecte Gemini Enterprise"
Acces communaute architectes certifies

🚀 Prochaines Etapes

Felicitations pour avoir complete Phase 4 !

Vous maitrisez maintenant :

✅ Deploiement enterprise production-ready
✅ FinOps & optimisation couts (60-80% economie)
✅ Gouvernance modeles et agents
✅ IA Responsable et compliance
✅ Ecosysteme Google AI complet

Continuer apprentissage :

Implementer projet reel : Appliquez architecture sur use case entreprise
Contribuer open-source : Gemma, Gemma Scope, VertexAI samples
Rejoindre communaute : Google AI Discord, forums GCP
Suivre actualites : Google AI Blog, Vertex AI release notes
Certifications complementaires : Professional Cloud Architect GCP

Resources :

📚 Documentation : cloud.google.com/vertex-ai/docs
💬 Community : discord.gg/google-ai
🎥 Videos : YouTube @GoogleCloudTech
📰 Blog : cloud.google.com/blog/products/ai-machine-learning

Vous etes maintenant Architecte Gemini certifie. Allez builder des applications IA incroyables !