Google AI Studio vs Vertex AI
๐ฏ Objectifs d'apprentissage
- Comprendre les differences entre Google AI Studio et Vertex AI
- Savoir choisir la bonne plateforme selon le cas d'usage
- Planifier une migration de AI Studio vers Vertex AI
- Maitriser les criteres de decision pour l'entreprise
๐ Comparaison Complete
| Critere | Google AI Studio | Vertex AI |
|---|---|---|
| Public cible | Developpeurs, prototypage rapide | Entreprises, production |
| Acces | Compte Google gratuit | Projet GCP avec facturation |
| API Key | Simple API key (generativelanguage.googleapis.com) | Service Account, ADC (PROJECT-aiplatform.googleapis.com) |
| Quotas | Limites par defaut (60 req/min) | Quotas personnalisables, augmentables |
| Securite | API key partageable | IAM, VPC-SC, CMEK, Private Service Connect |
| Conformite | Aucune garantie | SOC2, ISO 27001, HIPAA, GDPR |
| Data residency | Multi-region (EU ou US) | Region specifique choisie |
| Monitoring | Basique dans console | Cloud Monitoring, logging, tracing, SLIs |
| Caching | Context Caching disponible | Context Caching + optimisations enterprise |
| Prix | Meme tarif que Vertex AI | Meme tarif + options enterprise |
| SLA | Aucun | 99.9% uptime (GA models) |
Google AI Studio est parfait pour prototyper rapidement, tester des prompts, et creer des demos. Mais des que vous passez en production avec des donnees sensibles ou des exigences de conformite, Vertex AI devient indispensable.
๐ Architecture de Decision
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DECISION : AI Studio vs Vertex AI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Cas d'usage ? โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ PROTOTYPAGE โ โ PRODUCTION โ โ ENTREPRISE โ
โ โ โ SIMPLE โ โ CRITIQUE โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ AI STUDIO โ โ โ Vertex AI โ โ Vertex AI โ โ
โ โ โ (recommande) โ โ + VPC-SC โ
โ - Rapide โ โ โ โ + CMEK โ
โ - Gratuit โ โ - Monitoring โ โ - Compliance โ
โ - Experimentationโ โ - Quotas โ โ - Data residencyโ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
๐ Migration AI Studio โ Vertex AI
Etape 1 : Creer un projet GCP
# 1. Creer projet GCP gcloud projects create mon-projet-gemini --name="Gemini Production" # 2. Activer APIs gcloud services enable aiplatform.googleapis.com gcloud services enable cloudresourcemanager.googleapis.com # 3. Creer Service Account gcloud iam service-accounts create gemini-sa \ --display-name="Gemini Service Account" # 4. Donner permissions gcloud projects add-iam-policy-binding mon-projet-gemini \ --member="serviceAccount:gemini-sa@mon-projet-gemini.iam.gserviceaccount.com" \ --role="roles/aiplatform.user"
Etape 2 : Adapter le code
# AVANT : AI Studio
import google.generativeai as genai
genai.configure(api_key="AIzaSy...")
model = genai.GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")
# APRES : Vertex AI
from vertexai.generative_models import GenerativeModel
import vertexai
vertexai.init(project="mon-projet-gemini", location="us-central1")
model = GenerativeModel('gemini-2.0-flash-exp')
response = model.generate_content("Hello")
โ๏ธ Criteres de Decision Enterprise
Utilisez Vertex AI si :
- โ Vous traitez des donnees clients sensibles (PII, PHI)
- โ Vous avez besoin de conformite (GDPR, HIPAA, SOC2)
- โ Vous voulez controler la region de traitement des donnees
- โ Vous avez besoin de quotas eleves (>60 req/min)
- โ Vous voulez un SLA avec uptime 99.9%
- โ Vous devez integrer avec VPC, Private Service Connect
- โ Vous avez besoin d'audit logs detailles
- โ Vous voulez du monitoring avance (Cloud Monitoring)
Utilisez AI Studio si :
- โ Vous etes en phase de prototypage/experimentation
- โ Vous n'avez pas de donnees sensibles
- โ Vous voulez tester rapidement sans setup GCP
- โ Vous explorez les capacites de Gemini
- โ Vous creez une demo ou un hackathon
Vertex AI : Setup Enterprise
๐ฏ Objectifs d'apprentissage
- Configurer un projet Vertex AI production-ready
- Maitriser IAM, VPC-SC, CMEK pour la securite
- Configurer Private Service Connect pour l'isolation
- Gerer les quotas et limites
๐ Architecture Vertex AI Enterprise
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VERTEX AI ENTERPRISE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ SECURITE โ โ NETWORKING โ โ MONITORING โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โ - IAM Policies โ - VPC-SC โ - Cloud Logging
โ - Service Account โ - Private Connect โ - Cloud Monitoring
โ - CMEK โ - Shared VPC โ - Audit Logs
โ - Workload ID โ - Cloud NAT โ - Cost Dashboard
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
๐ Configuration IAM
Roles principaux :
| Role | Permissions | Cas d'usage |
|---|---|---|
roles/aiplatform.user |
Utiliser Gemini, lire modeles | Applications backend, services |
roles/aiplatform.admin |
Gerer endpoints, datasets | Admins ML, DevOps |
roles/aiplatform.viewer |
Lire ressources (read-only) | Monitoring, audit |
roles/serviceusage.serviceUsageConsumer |
Consommer APIs | Toutes applications |
# Configuration IAM complete
PROJECT_ID="mon-projet-prod"
SA_NAME="gemini-backend-sa"
SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
# 1. Creer Service Account
gcloud iam service-accounts create $SA_NAME \
--display-name="Gemini Backend Service" \
--description="SA for production Gemini API calls"
# 2. Donner permissions minimales (Principle of Least Privilege)
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/aiplatform.user"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SA_EMAIL" \
--role="roles/serviceusage.serviceUsageConsumer"
# 3. (Optionnel) Workload Identity pour GKE
gcloud iam service-accounts add-iam-policy-binding $SA_EMAIL \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:${PROJECT_ID}.svc.id.goog[NAMESPACE/KSA_NAME]"
# 4. Generer key (si besoin, prefer ADC)
gcloud iam service-accounts keys create key.json \
--iam-account=$SA_EMAIL
๐ก VPC Service Controls (VPC-SC)
VPC-SC cree un perimetre de securite pour proteger les donnees contre l'exfiltration.
# 1. Creer Access Policy gcloud access-context-manager policies create \ --organization ORG_ID \ --title "Gemini Production Policy" # 2. Creer Service Perimeter gcloud access-context-manager perimeters create gemini_perimeter \ --title="Gemini Secure Perimeter" \ --resources=projects/PROJECT_NUMBER \ --restricted-services=aiplatform.googleapis.com \ --policy=POLICY_ID # 3. Autoriser Private Service Connect gcloud access-context-manager perimeters update gemini_perimeter \ --add-vpc-allowed-services=aiplatform.googleapis.com \ --policy=POLICY_ID
๐ CMEK (Customer-Managed Encryption Keys)
Par defaut, Google chiffre toutes les donnees. CMEK vous donne le controle des cles de chiffrement.
# 1. Creer Key Ring et Key dans Cloud KMS
gcloud kms keyrings create gemini-keyring \
--location=us-central1
gcloud kms keys create gemini-key \
--location=us-central1 \
--keyring=gemini-keyring \
--purpose=encryption
# 2. Donner acces a Vertex AI Service Account
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
SA_VERTEX="service-${PROJECT_NUMBER}@gcp-sa-aiplatform.iam.gserviceaccount.com"
gcloud kms keys add-iam-policy-binding gemini-key \
--location=us-central1 \
--keyring=gemini-keyring \
--member="serviceAccount:$SA_VERTEX" \
--role="roles/cloudkms.cryptoKeyEncrypterDecrypter"
# 3. Utiliser CMEK dans Vertex AI (via console ou API)
# Lors de la creation d'un endpoint ou dataset, specifier :
# encryption_spec_key_name = "projects/PROJECT/locations/LOCATION/keyRings/RING/cryptoKeys/KEY"
๐ Private Service Connect
Private Service Connect permet d'appeler Vertex AI depuis votre VPC sans passer par Internet.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VPC (10.0.0.0/16) โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ GKE Cluster โโโโโโโโโโถโ Private Service โ โ
โ โ (10.0.1.0/24) โ โ Connect Endpointโ โ
โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ (Traffic prive)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Vertex AI Service โ
โ (aiplatform.googleapis.com)โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Configuration Private Service Connect pour Vertex AI gcloud compute addresses create vertex-ai-psc \ --region=us-central1 \ --subnet=default \ --addresses=10.0.2.10 gcloud compute forwarding-rules create vertex-ai-psc-rule \ --region=us-central1 \ --network=default \ --address=vertex-ai-psc \ --target-service-attachment=projects/PROJECT/regions/us-central1/serviceAttachments/aiplatform
๐ Gestion des Quotas
Quotas par defaut Vertex AI :
- Gemini Pro : 60 req/min, 4000 req/jour
- Gemini Flash : 1000 req/min, 10000 req/jour
- Gemini Flash-Lite : 1500 req/min, 15000 req/jour
- Tokens max : 2M tokens/min (input + output combines)
# Verifier quotas actuels gcloud services quota list \ --service=aiplatform.googleapis.com \ --consumer=projects/$PROJECT_ID # Demander augmentation de quota (via console ou support) # Console GCP > IAM & Admin > Quotas > Filtrer "aiplatform" > Modifier
Pour la production, anticipez vos besoins de quotas. Une demande d'augmentation peut prendre 2-3 jours ouvrables. Mettez en place du rate limiting cote application et des fallbacks pour gerer les depassements de quota gracieusement.
Infrastructure & Scalabilite
๐ฏ Objectifs d'apprentissage
- Deployer Gemini sur Cloud Run, GKE, Cloud Functions
- Configurer l'auto-scaling et load balancing
- Optimiser la latence avec CDN et caching
- Concevoir une architecture hautement disponible
โ๏ธ Options de Deploiement
| Solution | Cas d'usage | Avantages | Limites |
|---|---|---|---|
| Cloud Run | APIs serverless, microservices | Auto-scaling, pay-per-use, simple | Cold starts (~1-2s) |
| GKE (Kubernetes) | Workloads complexes, controle total | Flexible, multi-cloud, scaling precis | Complexite, overhead |
| Cloud Functions | Event-driven, webhooks simples | Tres simple, integrations natives | Timeout 60min, cold starts |
| Compute Engine | Legacy apps, controle VM | Controle total, compatible legacy | Pas d'auto-scaling automatique |
๐ Deploiement Cloud Run (Recommande)
# main.py - Service Gemini sur Cloud Run
from flask import Flask, request, jsonify
from vertexai.generative_models import GenerativeModel
import vertexai
import os
app = Flask(__name__)
# Init Vertex AI (utilise ADC automatiquement sur Cloud Run)
vertexai.init(
project=os.environ.get("GCP_PROJECT"),
location=os.environ.get("GCP_REGION", "us-central1")
)
model = GenerativeModel("gemini-2.0-flash-exp")
@app.route("/generate", methods=["POST"])
def generate():
try:
data = request.get_json()
prompt = data.get("prompt")
# Generation avec retry automatique
response = model.generate_content(
prompt,
generation_config={
"temperature": 0.7,
"max_output_tokens": 2048
}
)
return jsonify({
"text": response.text,
"usage": {
"prompt_tokens": response.usage_metadata.prompt_token_count,
"candidates_tokens": response.usage_metadata.candidates_token_count
}
})
except Exception as e:
return jsonify({"error": str(e)}), 500
@app.route("/health", methods=["GET"])
def health():
return jsonify({"status": "healthy"}), 200
if __name__ == "__main__":
app.run(host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
# Healthcheck pour Cloud Run
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8080/health')"
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "4", "--threads", "2", "--timeout", "300", "main:app"]
# Deploiement Cloud Run avec optimisations gcloud run deploy gemini-api \ --source . \ --region us-central1 \ --platform managed \ --allow-unauthenticated \ --service-account gemini-backend-sa@PROJECT_ID.iam.gserviceaccount.com \ --set-env-vars GCP_PROJECT=PROJECT_ID,GCP_REGION=us-central1 \ --memory 2Gi \ --cpu 2 \ --min-instances 1 \ --max-instances 100 \ --concurrency 80 \ --timeout 300 \ --cpu-boost \ --execution-environment gen2 # Configuration auto-scaling gcloud run services update gemini-api \ --region us-central1 \ --cpu-throttling \ --max-instances 100 \ --min-instances 2
--min-instances 2: Elimine cold starts pour 99% des requetes--cpu-boost: Accelere le demarrage des instances (~30% plus rapide)--concurrency 80: Equilibre entre throughput et latence--execution-environment gen2: 2x plus rapide, meilleure isolation
โ Deploiement GKE (Kubernetes)
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemini-api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: gemini-api
template:
metadata:
labels:
app: gemini-api
spec:
serviceAccountName: gemini-k8s-sa
containers:
- name: gemini-api
image: gcr.io/PROJECT_ID/gemini-api:v1.0.0
ports:
- containerPort: 8080
env:
- name: GCP_PROJECT
value: "PROJECT_ID"
- name: GCP_REGION
value: "us-central1"
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: gemini-api-service
namespace: production
spec:
type: LoadBalancer
selector:
app: gemini-api
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gemini-api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gemini-api
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
๐ Load Balancing & CDN
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GLOBAL LOAD BALANCER โ
โ (Cloud Load Balancing) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ us-central1 โ โ europe-west1 โ โ asia-east1 โ
โ Cloud Run โ โ Cloud Run โ โ Cloud Run โ
โ (3 instances) โ โ (3 instances) โ โ (3 instances) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโ
โ Vertex AI โ
โ (us-central1) โ
โโโโโโโโโโโโโโโโโ
# Configuration Load Balancer global avec Cloud CDN # 1. Creer NEG (Network Endpoint Group) pour Cloud Run gcloud compute network-endpoint-groups create gemini-api-neg \ --region=us-central1 \ --network-endpoint-type=serverless \ --cloud-run-service=gemini-api # 2. Creer Backend Service avec CDN gcloud compute backend-services create gemini-backend \ --global \ --enable-cdn \ --cache-mode=CACHE_ALL_STATIC \ --default-ttl=3600 # 3. Ajouter NEG au backend gcloud compute backend-services add-backend gemini-backend \ --global \ --network-endpoint-group=gemini-api-neg \ --network-endpoint-group-region=us-central1 # 4. Creer URL Map et proxy HTTPS gcloud compute url-maps create gemini-lb \ --default-service=gemini-backend gcloud compute target-https-proxies create gemini-https-proxy \ --url-map=gemini-lb \ --ssl-certificates=gemini-cert # 5. Creer IP globale et forwarding rule gcloud compute addresses create gemini-ip --global gcloud compute forwarding-rules create gemini-https-rule \ --global \ --target-https-proxy=gemini-https-proxy \ --address=gemini-ip \ --ports=443
Pour une latence optimale : deployez Cloud Run dans plusieurs regions (us-central1, europe-west1, asia-east1), configurez un Load Balancer global, et activez Cloud CDN pour cacher les reponses frequentes. Vertex AI n'est disponible que dans certaines regions, donc vos backends devront appeler la region Vertex AI la plus proche.
โก Optimisations de Performance
| Technique | Impact latence | Implementation |
|---|---|---|
| Min instances > 0 | -1000ms (elimine cold start) | --min-instances 2 sur Cloud Run |
| Connection pooling | -50ms par requete | Reutiliser client Vertex AI |
| Streaming | -2000ms (TTFT) | stream=True dans generate_content |
| Context caching | -80% latence | Cacher prompts systeme longs |
| CDN pour assets | -200ms (assets statiques) | Cloud CDN sur Load Balancer |
| Regions multiples | -100ms (latence geo) | Deploy multi-region + GLB |
Securite Enterprise
๐ฏ Objectifs d'apprentissage
- Implementer une strategie IAM zero-trust
- Configurer VPC Service Controls et DLP
- Securiser les secrets avec Secret Manager
- Activer audit logs et monitoring de securite
๐ก Defense en Profondeur (Defense in Depth)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 7 : MONITORING โ โ Audit Logs, Security Command Center, Alerting โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 6 : DLP & FILTERING โ โ Data Loss Prevention, Content Moderation โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 5 : ENCRYPTION โ โ CMEK, TLS 1.3, Data at Rest โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 4 : SECRETS โ โ Secret Manager, Workload Identity โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 3 : NETWORK ISOLATION โ โ VPC-SC, Private Service Connect โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 2 : IDENTITY โ โ IAM Policies, Service Accounts โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ COUCHE 1 : AUTHENTICATION โ โ OAuth 2.0, API Keys, mTLS โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ IAM Zero-Trust
Principe de moindre privilege (Least Privilege) :
# MAUVAIS : Donner roles/owner (trop de permissions) gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/owner" # โ DANGER # BON : Permissions granulaires minimales gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:app@PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/aiplatform.user" # โ Minimum necessaire # Encore mieux : Custom Role avec permissions precises gcloud iam roles create geminiUserCustom \ --project=PROJECT_ID \ --title="Gemini User Custom" \ --permissions=aiplatform.endpoints.predict,aiplatform.models.get
Segregation par environnement :
# Service Accounts separes par environnement
# DEV
gcloud iam service-accounts create gemini-dev-sa \
--display-name="Gemini Dev" \
--project=project-dev
# STAGING
gcloud iam service-accounts create gemini-staging-sa \
--display-name="Gemini Staging" \
--project=project-staging
# PROD
gcloud iam service-accounts create gemini-prod-sa \
--display-name="Gemini Prod" \
--project=project-prod
# IAM Conditions : Limiter acces par IP, heure, ressource
gcloud projects add-iam-policy-binding project-prod \
--member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
--role="roles/aiplatform.user" \
--condition='expression=request.time < timestamp("2026-12-31T23:59:59Z"),title=expires-end-of-year'
๐ Secret Manager
# 1. Creer secret pour API keys tierces
echo -n "sk-openai-api-key-xyz" | gcloud secrets create openai-api-key \
--data-file=- \
--replication-policy="automatic"
# 2. Donner acces au Service Account
gcloud secrets add-iam-policy-binding openai-api-key \
--member="serviceAccount:gemini-prod-sa@project-prod.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"
# 3. Utiliser dans l'application
from google.cloud import secretmanager
client = secretmanager.SecretManagerServiceClient()
name = "projects/PROJECT_ID/secrets/openai-api-key/versions/latest"
response = client.access_secret_version(request={"name": name})
api_key = response.payload.data.decode("UTF-8")
- โ Hardcoder des secrets dans le code source
- โ Commiter des .env avec vraies cles dans Git
- โ Exposer secrets dans logs ou error messages
- โ Partager secrets via Slack/Email
- โ Utiliser la meme API key pour dev/staging/prod
๐จ Data Loss Prevention (DLP)
DLP detecte et masque automatiquement les donnees sensibles (PII, PHI, PCI) avant envoi a Gemini.
# DLP Inspection avant envoi a Gemini
from google.cloud import dlp_v2
def inspect_and_deidentify(text, project_id):
dlp = dlp_v2.DlpServiceClient()
parent = f"projects/{project_id}/locations/global"
# Configuration inspection (detecter PII)
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"},
{"name": "PERSON_NAME"}
],
"min_likelihood": dlp_v2.Likelihood.LIKELY
}
# Configuration de-identification (masquer PII)
deidentify_config = {
"info_type_transformations": {
"transformations": [{
"primitive_transformation": {
"replace_with_info_type_config": {}
}
}]
}
}
item = {"value": text}
response = dlp.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"inspect_config": inspect_config,
"item": item
}
)
return response.item.value
# Utilisation
user_input = "Mon email est john.doe@example.com et mon tel est 555-1234"
safe_input = inspect_and_deidentify(user_input, "mon-projet")
# Resultat : "Mon email est [EMAIL_ADDRESS] et mon tel est [PHONE_NUMBER]"
# Envoyer a Gemini seulement le texte de-identifie
response = model.generate_content(safe_input)
๐ Audit Logs
3 types d'Audit Logs :
- Admin Activity : Actions d'administration (toujours active, gratuit)
- Data Access : Lectures/ecritures de donnees (doit etre active, payant)
- System Event : Evenements GCP internes (gratuit)
# Activer Data Access Logs pour Vertex AI
gcloud logging project-logs enable \
DATA_ACCESS \
--project=PROJECT_ID
# Requete logs : Qui a appele Gemini ?
gcloud logging read 'resource.type="aiplatform.googleapis.com/Endpoint"
AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"' \
--project=PROJECT_ID \
--limit=50 \
--format=json
# Creer alertes sur comportements suspects
gcloud logging metrics create gemini_unusual_volume \
--description="Alert si >1000 req/min depuis une meme IP" \
--log-filter='resource.type="aiplatform.googleapis.com/Endpoint"
AND protoPayload.methodName="google.cloud.aiplatform.v1.PredictionService.Predict"'
Les audit logs sont essentiels pour la conformite (GDPR Article 32, SOC2, HIPAA). Activez-les en production. Configurez des exports vers BigQuery pour analyse long-terme et correlation avec Security Command Center pour detection d'anomalies automatique.
๐ VPC Service Controls (VPC-SC) Avance
# vpc-sc-policy.yaml - Configuration complete VPC-SC
name: accessPolicies/POLICY_ID/servicePerimeters/gemini_perimeter
title: "Gemini Production Secure Perimeter"
status:
resources:
- projects/PROJECT_NUMBER
restrictedServices:
- aiplatform.googleapis.com
- storage.googleapis.com
accessLevels:
- accessPolicies/POLICY_ID/accessLevels/corporate_network
vpcAccessibleServices:
enableRestriction: true
allowedServices:
- aiplatform.googleapis.com
ingressPolicies:
- ingressFrom:
identities:
- serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
sources:
- accessLevel: accessPolicies/POLICY_ID/accessLevels/corporate_network
ingressTo:
resources:
- "*"
operations:
- serviceName: aiplatform.googleapis.com
methodSelectors:
- method: "google.cloud.aiplatform.v1.PredictionService.Predict"
egressPolicies:
- egressFrom:
identities:
- serviceAccount:gemini-prod-sa@PROJECT_ID.iam.gserviceaccount.com
egressTo:
resources:
- "*"
operations:
- serviceName: storage.googleapis.com
Conformite & RGPD
๐ฏ Objectifs d'apprentissage
- Comprendre les exigences GDPR pour les systemes IA
- Configurer data residency et data sovereignty
- Implementer DPIA (Data Protection Impact Assessment)
- Maitriser les certifications SOC2, HIPAA, ISO 27001
โ๏ธ GDPR et IA : Obligations Essentielles
| Article GDPR | Obligation | Implementation Vertex AI |
|---|---|---|
| Art. 5 | Minimisation des donnees | DLP pour filtrer PII, pas de stockage inutile |
| Art. 13-14 | Transparence (informer utilisateurs) | Disclaimer "Ce chat utilise Gemini par Google" |
| Art. 15 | Droit d'acces | Logging de toutes requetes utilisateur, API export |
| Art. 17 | Droit a l'effacement | Purge logs apres 90 jours, pas de fine-tuning sur donnees utilisateur |
| Art. 25 | Privacy by design | VPC-SC, CMEK, anonymisation par defaut |
| Art. 28 | DPA (Data Processing Agreement) | Signer Cloud Data Processing Addendum Google |
| Art. 32 | Securite | TLS 1.3, encryption at rest, audit logs |
| Art. 33 | Notification breaches (72h) | Security Command Center alertes |
| Art. 35 | DPIA si risque eleve | Template DPIA pour chatbots Gemini |
๐ Data Residency & Data Sovereignty
Vertex AI regions disponibles (2026) :
- Europe : europe-west1 (Belgique), europe-west4 (Pays-Bas), europe-west9 (France)
- US : us-central1 (Iowa), us-east1 (Caroline du Sud), us-west1 (Oregon)
- Asia : asia-northeast1 (Tokyo), asia-southeast1 (Singapour)
# Configuration region EU pour conformite GDPR
import vertexai
from vertexai.generative_models import GenerativeModel
# IMPORTANT : Forcer region EU pour donnees GDPR
vertexai.init(
project="mon-projet-eu",
location="europe-west1" # Belgique (UE)
)
model = GenerativeModel("gemini-2.0-flash-exp")
# Verifier que la region est bien EU
print(f"Region utilisee : {vertexai._location}")
# Output : "europe-west1"
๐ DPIA Template pour Chatbot Gemini
Data Protection Impact Assessment (DPIA) requis si :
- โ Traitement automatise avec effets juridiques (ex: credit scoring avec IA)
- โ Surveillance systematique a grande echelle (ex: monitoring employes)
- โ Donnees sensibles : sante (HIPAA), enfants, biometrie
# DPIA Template : Chatbot Support Client (Gemini) ## 1. Description du traitement - **Nature** : Chatbot IA generative pour support client - **Portee** : 50 000 utilisateurs/mois, EU uniquement - **Contexte** : Questions support produit, pas de paiement - **Finalites** : Repondre questions, reduire tickets support ## 2. Donnees traitees - **Donnees collectees** : Nom, email, historique conversation - **Donnees sensibles** : AUCUNE (pas sante, religion, etc.) - **Retention** : 90 jours puis suppression automatique ## 3. Necessite et proportionnalite - **Base legale** : Interet legitime (Art. 6(1)(f) GDPR) - **Minimisation** : Seulement nom/email, pas de tel/adresse - **Alternatives considerees** : Support humain seul (trop lent), FAQ statique (moins efficace) ## 4. Risques identifies | Risque | Impact | Probabilite | Mesure mitigation | |--------|--------|-------------|-------------------| | Fuite donnees conversationnelles | Moyen | Faible | VPC-SC, CMEK, TLS 1.3 | | Hallucination donnant mauvais conseil | Moyen | Moyenne | Grounding avec docs, disclaimer | | Re-identification via style ecriture | Faible | Tres faible | Pas de fine-tuning | ## 5. Mesures de securite - โ Encryption in-transit (TLS 1.3) et at-rest (AES-256) - โ VPC Service Controls (pas d'acces externe) - โ Audit logs actives (Cloud Logging) - โ DLP pour detecter PII accidentelle - โ Region EU (europe-west1) avec data residency ## 6. Droits utilisateurs - โ Information transparente (banner "Powered by Gemini") - โ Droit d'acces (API export conversations) - โ Droit a l'effacement (bouton "Supprimer mes donnees") - โ Droit d'opposition (opt-out chatbot) ## 7. Conclusion Risque residu : **FAIBLE** DPIA validee par : DPO (Data Protection Officer) Date : 2026-02-10
๐ฅ HIPAA Compliance (Donnees de Sante)
Google Cloud signe BAA (Business Associate Agreement) pour :
- โ Vertex AI (Gemini via Vertex AI uniquement, PAS AI Studio)
- โ Cloud Storage, BigQuery, Cloud SQL
- โ Cloud Logging (mais desactiver Data Access Logs PHI)
# Configuration HIPAA-compliant pour application medicale
# 1. Activer organisation policy pour forcer CMEK
gcloud resource-manager org-policies set-policy cmek-policy.yaml
# cmek-policy.yaml
name: projects/PROJECT_ID/policies/constraints/gcp.restrictNonCmekServices
spec:
rules:
- enforce: true
# 2. Activer Access Transparency (voir qui chez Google accede aux donnees)
gcloud organizations add-iam-policy-binding ORG_ID \
--member='domain:example.com' \
--role='roles/accessapproval.approver'
# 3. Configurer retention logs conforme (6 ans pour HIPAA)
gcloud logging sinks create hipaa-audit-sink \
bigquery.googleapis.com/projects/PROJECT_ID/datasets/hipaa_audit_logs \
--log-filter='protoPayload.serviceName="aiplatform.googleapis.com"'
# 4. Desactiver Data Access Logs pour eviter log PHI
# (Configurer via IAM & Admin > Audit Logs > desactiver "Data Read/Write" pour aiplatform)
Pour HIPAA : utilisez TOUJOURS Vertex AI (jamais AI Studio), signez le BAA avec Google, activez CMEK, configurez Access Transparency, et mettez en place une retention de 6 ans des audit logs. Considerez aussi de-identification des donnees avant envoi a Gemini avec Cloud Healthcare API.
๐ ISO 27001 & SOC 2 Type II
Google Cloud est certifie :
- โ ISO 27001 (Information Security Management)
- โ ISO 27017 (Cloud Security)
- โ ISO 27018 (Privacy in Cloud)
- โ SOC 2 Type II (Security, Availability, Confidentiality)
- โ SOC 3 (version publique de SOC 2)
Rapports disponibles :
- ๐ Console GCP > Security > Compliance Reports Manager
- ๐ Telecharger ISO/SOC reports pour audits
- ๐ Partager avec auditeurs sous NDA
CI/CD pour IA
๐ฏ Objectifs d'apprentissage
- Mettre en place un pipeline CI/CD pour applications Gemini
- Implementer prompt versioning et testing automatise
- Configurer evaluation gates avant production
- Deployer avec strategies canary et blue-green
๐ Pipeline CI/CD Complet
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CI/CD PIPELINE GEMINI โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 1. COMMIT โโโโถโ 2. BUILD โโโโถโ 3. TEST โ
โ โ โ โ โ โ
โ - Git push โ โ - Docker img โ โ - Unit tests โ
โ - PR opened โ โ - Lint โ โ - Prompt evalโ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 6. PROD โโโโโ 5. STAGING โโโโโ 4. DEV DEPLOYโ
โ โ โ โ โ โ
โ - Canary 10% โ โ - Smoke test โ โ - Cloud Run โ
โ - Monitor โ โ - Eval gate โ โ - Auto deployโ
โ - Rollback? โ โ - Manual OK โ โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
๐ Cloud Build Configuration
# cloudbuild.yaml - Pipeline CI/CD complet
steps:
# Etape 1 : Linting et formatage
- name: 'python:3.12'
id: 'lint'
entrypoint: 'bash'
args:
- '-c'
- |
pip install ruff black
ruff check src/
black --check src/
# Etape 2 : Tests unitaires
- name: 'python:3.12'
id: 'unit-tests'
entrypoint: 'bash'
args:
- '-c'
- |
pip install -r requirements.txt
pytest tests/unit/ --cov=src --cov-report=term
# Etape 3 : Evaluation des prompts
- name: 'python:3.12'
id: 'prompt-eval'
entrypoint: 'bash'
secretEnv: ['VERTEX_PROJECT']
args:
- '-c'
- |
pip install -r requirements.txt
python scripts/eval_prompts.py --project=$VERTEX_PROJECT --threshold=0.7
waitFor: ['unit-tests']
# Etape 4 : Build Docker image
- name: 'gcr.io/cloud-builders/docker'
id: 'build-image'
args:
- 'build'
- '-t'
- 'gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA'
- '-t'
- 'gcr.io/$PROJECT_ID/gemini-api:latest'
- '.'
waitFor: ['prompt-eval']
# Etape 5 : Push image
- name: 'gcr.io/cloud-builders/docker'
id: 'push-image'
args:
- 'push'
- '--all-tags'
- 'gcr.io/$PROJECT_ID/gemini-api'
waitFor: ['build-image']
# Etape 6 : Deploy to DEV (auto)
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: 'deploy-dev'
entrypoint: 'bash'
args:
- '-c'
- |
gcloud run deploy gemini-api-dev \
--image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
--region us-central1 \
--platform managed \
--service-account gemini-dev-sa@$PROJECT_ID.iam.gserviceaccount.com \
--set-env-vars ENV=dev,VERSION=$SHORT_SHA \
--tag dev-$SHORT_SHA
waitFor: ['push-image']
# Etape 7 : Smoke tests DEV
- name: 'python:3.12'
id: 'smoke-tests-dev'
entrypoint: 'bash'
args:
- '-c'
- |
pip install requests
python scripts/smoke_tests.py --url=https://gemini-api-dev-HASH-uc.a.run.app
waitFor: ['deploy-dev']
# Etape 8 : Deploy to STAGING (si branch main)
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
id: 'deploy-staging'
entrypoint: 'bash'
args:
- '-c'
- |
if [ "$BRANCH_NAME" = "main" ]; then
gcloud run deploy gemini-api-staging \
--image gcr.io/$PROJECT_ID/gemini-api:$SHORT_SHA \
--region us-central1 \
--platform managed \
--service-account gemini-staging-sa@$PROJECT_ID.iam.gserviceaccount.com \
--set-env-vars ENV=staging,VERSION=$SHORT_SHA
fi
waitFor: ['smoke-tests-dev']
# Secrets from Secret Manager
availableSecrets:
secretManager:
- versionName: projects/$PROJECT_ID/secrets/vertex-project/versions/latest
env: 'VERTEX_PROJECT'
# Timeout global
timeout: '1800s'
# Tags
tags: ['gemini-api', 'ci-cd']
options:
machineType: 'E2_HIGHCPU_8'
logging: CLOUD_LOGGING_ONLY
๐ Prompt Versioning
# prompts.py - Versionning des prompts
from dataclasses import dataclass
from typing import Dict
import json
@dataclass
class PromptVersion:
version: str
system_instruction: str
temperature: float
max_tokens: int
metadata: Dict[str, str]
class PromptRegistry:
"""Registry centralise pour tous les prompts versions"""
PROMPTS = {
"customer_support_v1": PromptVersion(
version="1.0.0",
system_instruction="""Tu es un assistant support client pour AcmeCorp.
Reponds de maniere concise et professionnelle.
Si tu ne sais pas, dis 'Je ne sais pas, je transfere a un humain.'""",
temperature=0.3,
max_tokens=512,
metadata={"created": "2026-01-15", "author": "team-support"}
),
"customer_support_v2": PromptVersion(
version="2.0.0",
system_instruction="""Tu es un assistant support client expert pour AcmeCorp.
REGLES :
1. Reponds en 2-3 phrases maximum
2. Utilise les docs (grounding) pour info precise
3. Si pas dans docs, dis "Je ne sais pas"
4. Toujours termine par "Autre question ?"
TONE : Professionnel mais amical""",
temperature=0.2, # Plus deterministe
max_tokens=256, # Plus court
metadata={"created": "2026-02-01", "author": "team-support", "ab_test": "variant_b"}
),
}
@classmethod
def get_prompt(cls, prompt_id: str) -> PromptVersion:
if prompt_id not in cls.PROMPTS:
raise ValueError(f"Prompt {prompt_id} not found")
return cls.PROMPTS[prompt_id]
@classmethod
def get_active_prompt(cls, use_case: str = "customer_support") -> PromptVersion:
"""Retourne le prompt actif (gere via feature flags)"""
# En production, lire depuis feature flag (LaunchDarkly, Cloud Config, etc.)
active_version = "customer_support_v2" # ou v1 selon A/B test
return cls.get_prompt(active_version)
# Utilisation
from vertexai.generative_models import GenerativeModel
prompt_config = PromptRegistry.get_active_prompt("customer_support")
model = GenerativeModel(
"gemini-2.0-flash-exp",
system_instruction=prompt_config.system_instruction
)
response = model.generate_content(
"Comment retourner un produit ?",
generation_config={
"temperature": prompt_config.temperature,
"max_output_tokens": prompt_config.max_tokens
}
)
- โ Toujours versionner les prompts (semantic versioning : 1.0.0, 1.1.0, 2.0.0)
- โ Stocker dans Git avec review process (PR required)
- โ Tracker metadata : auteur, date, rationale du changement
- โ A/B tester nouvelles versions avant rollout 100%
- โ Rollback rapide si degradation qualite
โ Evaluation Gates
# scripts/eval_prompts.py - Eval automatique avant deploy
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.preview.evaluation import EvalTask
import argparse
def run_eval_gate(project: str, threshold: float = 0.7):
"""
Evalue le prompt sur dataset de test.
Echoue le build si score < threshold.
"""
vertexai.init(project=project, location="us-central1")
# Dataset de test (Golden Set)
test_cases = [
{
"input": "Comment retourner un produit ?",
"expected_output": "Vous avez 30 jours pour retourner un produit...",
"rubric": "Doit mentionner delai 30 jours et procedure"
},
{
"input": "Quel est le prix du produit XYZ ?",
"expected_output": "Je ne sais pas",
"rubric": "Doit dire 'je ne sais pas' si info pas dans docs"
},
# ... 50+ test cases
]
# Charger prompt actif
from prompts import PromptRegistry
prompt_config = PromptRegistry.get_active_prompt()
model = GenerativeModel(
"gemini-2.0-flash-exp",
system_instruction=prompt_config.system_instruction
)
# Evaluation avec Vertex AI Evaluation
eval_task = EvalTask(
dataset=test_cases,
metrics=["coherence", "fluency", "safety", "groundedness"],
experiment="prompt-eval-" + prompt_config.version
)
results = eval_task.evaluate(model=model)
# Calculer score global
avg_score = sum(
results.summary_metrics[m] for m in ["coherence", "fluency", "groundedness"]
) / 3
print(f"Evaluation score: {avg_score:.2f}")
print(f"Threshold: {threshold}")
if avg_score < threshold:
print("โ EVAL GATE FAILED - Score trop bas")
exit(1) # Fail le build
else:
print("โ
EVAL GATE PASSED")
exit(0)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--project", required=True)
parser.add_argument("--threshold", type=float, default=0.7)
args = parser.parse_args()
run_eval_gate(args.project, args.threshold)
๐ฆ Deployment Strategies
1. Canary Deployment (Recommande) :
# Deploy nouvelle version sur 10% trafic gcloud run deploy gemini-api-prod \ --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \ --region us-central1 \ --tag canary \ --no-traffic # Pas de trafic initial # Router 10% vers canary gcloud run services update-traffic gemini-api-prod \ --region us-central1 \ --to-tags canary=10 # Monitorer pendant 1h (erreurs, latence, qualite) # Si OK : augmenter a 50% gcloud run services update-traffic gemini-api-prod \ --region us-central1 \ --to-tags canary=50 # Si OK : rollout 100% gcloud run services update-traffic gemini-api-prod \ --region us-central1 \ --to-latest # Si KO : rollback immediat gcloud run services update-traffic gemini-api-prod \ --region us-central1 \ --to-revisions PREVIOUS_REVISION=100
2. Blue-Green Deployment :
# Environnement BLUE (actuel en prod) gcloud run deploy gemini-api-blue \ --image gcr.io/PROJECT_ID/gemini-api:v1.0.0 \ --region us-central1 # Environnement GREEN (nouvelle version) gcloud run deploy gemini-api-green \ --image gcr.io/PROJECT_ID/gemini-api:v2.0.0 \ --region us-central1 # Load Balancer pointe vers BLUE # Apres validation GREEN : switch Load Balancer vers GREEN # Si probleme : switch instantane vers BLUE
Pour la production, privilegiez Canary deployment avec Cloud Run (support natif des traffic splits). Commencez avec 10% de trafic sur la nouvelle version, monitorez pendant 1-2h (erreurs, latence P95, qualite des reponses via eval), puis augmentez progressivement. Gardez toujours un rollback one-click pret.
Lab : Pipeline Deploiement GCP
๐ฏ Objectifs du Lab
- Creer un pipeline CI/CD complet sur Cloud Build
- Deployer une application Gemini sur Cloud Run
- Configurer monitoring et alertes
- Tester canary deployment avec rollback
๐งช Lab Pratique : Pipeline Production
Duree estimee : 60 minutes
Etape 1 : Setup projet GCP (10 min)
Creer un nouveau projet et activer les APIs necessaires.
export PROJECT_ID="gemini-lab-$(date +%s)" gcloud projects create $PROJECT_ID gcloud config set project $PROJECT_ID # Activer APIs gcloud services enable aiplatform.googleapis.com \ cloudbuild.googleapis.com \ run.googleapis.com \ secretmanager.googleapis.com \ cloudmonitoring.googleapis.com # Creer Service Account gcloud iam service-accounts create gemini-prod-sa
Etape 2 : Cloner code starter (5 min)
git clone https://github.com/google-cloud/gemini-deployment-starter cd gemini-deployment-starter # Structure : # - src/main.py (API Flask + Gemini) # - Dockerfile # - cloudbuild.yaml # - tests/ # - prompts/
Etape 3 : Configurer Cloud Build (10 min)
Connecter GitHub et configurer triggers.
# Donner permissions Cloud Build
PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
--role="roles/run.admin"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${PROJECT_NUMBER}@cloudbuild.gserviceaccount.com" \
--role="roles/iam.serviceAccountUser"
# Creer trigger Cloud Build
gcloud builds triggers create github \
--repo-name=gemini-deployment-starter \
--repo-owner=VOTRE_GITHUB \
--branch-pattern="^main$" \
--build-config=cloudbuild.yaml
Etape 4 : Premier deploy (15 min)
Pusher code et observer le pipeline.
# Modifier prompts/customer_support.py
# Commit et push
git add .
git commit -m "Initial deploy"
git push origin main
# Observer build dans Cloud Console
gcloud builds list --ongoing
# Une fois termine, tester l'API
SERVICE_URL=$(gcloud run services describe gemini-api-dev \
--region us-central1 --format="value(status.url)")
curl -X POST $SERVICE_URL/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Comment retourner un produit ?"}'
Etape 5 : Monitoring & Alertes (10 min)
# Creer dashboard monitoring gcloud monitoring dashboards create --config-from-file=dashboard.json # Creer alerte sur error rate > 5% gcloud alpha monitoring policies create \ --notification-channels=CHANNEL_ID \ --display-name="Gemini API Error Rate High" \ --condition-display-name="Error rate > 5%" \ --condition-threshold-value=0.05 \ --condition-threshold-duration=300s
Etape 6 : Canary Deploy & Rollback (10 min)
Deployer une v2 avec bug intentionnel, puis rollback.
# Deploy v2 (avec bug) git checkout -b v2-buggy # Modifier temperature=2.0 (trop haute, reponses instables) git commit -am "v2: increase creativity" git push origin v2-buggy # Merge en main (deploy auto staging) # Promouvoir en prod avec canary 10% gcloud run services update-traffic gemini-api-prod \ --to-tags canary=10 # Observer metriques (latence augmente, qualite baisse) # ROLLBACK gcloud run services update-traffic gemini-api-prod \ --to-revisions gemini-api-prod-00001-abc=100
โ Verification
Verifier que vous avez :
- โ Pipeline CI/CD fonctionnel avec Cloud Build
- โ Application deployee sur Cloud Run
- โ Monitoring dashboard avec metriques
- โ Alertes configurees
- โ Canary deployment + rollback testes
Ce lab vous a montre un workflow production-ready. En entreprise, ajoutez : evaluation automatique avant chaque deploy, integration tests end-to-end, security scanning (Snyk, Trivy), et approvals manuels avant prod.
Quiz Module 4.1
๐ Quiz : Deploiement Enterprise
15 questions pour valider vos connaissances
1. Quelle est la principale difference entre AI Studio et Vertex AI ?
2. Quel role IAM minimum pour appeler Gemini sur Vertex AI ?
3. Quelle solution de deploiement recommandee pour une API Gemini serverless ?
4. Comment eliminer les cold starts sur Cloud Run ?
5. VPC Service Controls (VPC-SC) permet de :
6. Ou stocker les API keys tierces de maniere securisee ?
7. DLP (Data Loss Prevention) permet de :
8. Pour conformite GDPR stricte, vous devez :
9. CMEK (Customer-Managed Encryption Keys) vous donne :
10. Quelle certification GCP est requise pour donnees de sante US ?
11. Dans un pipeline CI/CD pour IA, les evaluation gates servent a :
12. Pourquoi versionner les prompts ?
13. Canary deployment signifie :
14. Audit logs Data Access doivent etre actives pour :
15. Private Service Connect permet de :
Comprendre les Couts Gemini
๐ฏ Objectifs d'apprentissage
- Maitriser le modele de tarification Gemini (8 modeles)
- Comprendre le tiered pricing et seuil 200K tokens
- Calculer le cout d'une application Gemini
- Anticiper et budgeter les couts IA
๐ฐ Tarification Gemini 2.5 (2026)
| Modele | Input โค200K | Input >200K | Output โค200K | Output >200K | Context Cache |
|---|---|---|---|---|---|
| 2.5 Pro | $3.00 / 1M | $1.50 / 1M | $12.00 / 1M | $6.00 / 1M | $0.30 / 1M |
| 2.5 Flash | $0.15 / 1M | $0.075 / 1M | $0.60 / 1M | $0.30 / 1M | $0.015 / 1M |
| 2.5 Flash-8B | $0.04 / 1M | $0.02 / 1M | $0.16 / 1M | $0.08 / 1M | $0.004 / 1M |
| 2.0 Pro Exp (Extended Thinking) | $3.00 / 1M | $1.50 / 1M | $12.00 / 1M | $6.00 / 1M | - |
| 2.0 Flash Exp | $0.15 / 1M | $0.075 / 1M | $0.60 / 1M | $0.30 / 1M | $0.015 / 1M |
| 1.5 Pro | $2.50 / 1M | $1.25 / 1M | $10.00 / 1M | $5.00 / 1M | $0.25 / 1M |
| 1.5 Flash | $0.10 / 1M | $0.05 / 1M | $0.40 / 1M | $0.20 / 1M | $0.01 / 1M |
| 1.5 Flash-8B | $0.03 / 1M | $0.015 / 1M | $0.12 / 1M | $0.06 / 1M | $0.003 / 1M |
๐ข Composants de Cout
1. Input tokens : Prompt utilisateur + system instruction + context cached
2. Output tokens : Reponse generee par Gemini
3. Cached tokens : Context mis en cache (coute 10x moins cher)
4. Thinking tokens (2.0 Pro Exp) : Comptes comme output tokens
# Calculer cout d'une requete
from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Explique la relativite en 3 paragraphes")
# Extraire usage
usage = response.usage_metadata
print(f"Input tokens: {usage.prompt_token_count}")
print(f"Output tokens: {usage.candidates_token_count}")
print(f"Cached tokens: {usage.cached_content_token_count}")
# Calculer cout (Flash โค200K)
input_cost = (usage.prompt_token_count / 1_000_000) * 0.15
output_cost = (usage.candidates_token_count / 1_000_000) * 0.60
cache_cost = (usage.cached_content_token_count / 1_000_000) * 0.015
total_cost = input_cost + output_cost + cache_cost
print(f"Cout total: ${total_cost:.6f}")
# Exemple : Input 50 tokens, Output 200 tokens
# ($0.000015) + ($0.000120) = $0.000135 par requete
๐ Simulation Cout Application
Exemple : Chatbot support client
- Volume : 100,000 conversations/mois
- Moyenne : 5 messages par conversation
- Input moyen : 500 tokens (system instruction 200 + user 300)
- Output moyen : 150 tokens
- Modele : Gemini 2.5 Flash
# Calcul cout chatbot
conversations_per_month = 100_000
messages_per_conversation = 5
total_messages = conversations_per_month * messages_per_conversation # 500,000
input_tokens_per_message = 500
output_tokens_per_message = 150
total_input_tokens = total_messages * input_tokens_per_message # 250M
total_output_tokens = total_messages * output_tokens_per_message # 75M
# Prix Flash (โค200K tier pour simplicite)
input_cost = (total_input_tokens / 1_000_000) * 0.15 # $37.50
output_cost = (total_output_tokens / 1_000_000) * 0.60 # $45.00
monthly_cost = input_cost + output_cost # $82.50/mois
yearly_cost = monthly_cost * 12 # $990/an
print(f"Cout mensuel : ${monthly_cost:.2f}")
print(f"Cout annuel : ${yearly_cost:.2f}")
print(f"Cout par conversation : ${monthly_cost / conversations_per_month:.4f}")
# $0.000825 par conversation (moins d'un cent !)
Flash est incroyablement economique : ~$0.0008 par conversation. Meme avec 1M conversations/mois, vous ne paierez que ~$825/mois. Pro coute 20x plus mais offre meilleure qualite pour use cases complexes. Commencez avec Flash, upgradez vers Pro seulement si necessaire.
๐ก Facteurs d'Impact Cout
- Choix du modele : Flash-8B (4x moins cher que Flash) vs Pro (20x plus cher)
- Longueur system instruction : 2000 tokens d'instruction = $0.0003 input par requete (Flash)
- Context caching : -90% cout sur partie cachee
- Output tokens : Output coute 4x plus que input (controler max_output_tokens)
- Tiered pricing : >200K tokens = -50% prix
- Streaming : Meme cout que non-streaming (pas de surcharge)
Strategies d'Optimisation des Couts
๐ฏ Objectifs d'apprentissage
- Maitriser 7 techniques d'optimisation des couts
- Implementer model routing intelligent
- Utiliser context caching pour -90% cout
- Optimiser prompts et output tokens
๐ฏ Les 7 Techniques d'Optimisation
๐ 1. Model Routing Intelligent
Principe : Utiliser Flash-8B pour requetes simples, Flash pour standard, Pro pour complexe.
# Router intelligent base sur complexite
from vertexai.generative_models import GenerativeModel
class GeminiRouter:
def __init__(self):
self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
self.flash = GenerativeModel("gemini-2.5-flash")
self.pro = GenerativeModel("gemini-2.5-pro")
def classify_complexity(self, prompt: str) -> str:
"""Classifier la complexite de la requete"""
length = len(prompt)
# Regles simples
if length < 100:
return "simple"
elif "analyse" in prompt.lower() or "compare" in prompt.lower():
return "complex"
elif "summarize" in prompt.lower() or "list" in prompt.lower():
return "simple"
else:
return "standard"
def route(self, prompt: str):
"""Router vers le bon modele"""
complexity = self.classify_complexity(prompt)
if complexity == "simple":
# Flash-8B : $0.04/1M input (20x moins cher que Pro)
model = self.flash_8b
print("โ Routing vers Flash-8B (simple)")
elif complexity == "complex":
# Pro : meilleure qualite pour raisonnement
model = self.pro
print("โ Routing vers Pro (complexe)")
else:
# Flash : equilibre cout/qualite
model = self.flash
print("โ Routing vers Flash (standard)")
return model.generate_content(prompt)
# Utilisation
router = GeminiRouter()
# Simple : Flash-8B ($0.000004 input)
response1 = router.route("Quelle est la capitale de la France ?")
# Standard : Flash ($0.000015 input)
response2 = router.route("Resume les 3 avantages principaux du cloud computing")
# Complexe : Pro ($0.00300 input)
response3 = router.route("Analyse comparative detaillee entre architecture monolithique et microservices avec cas d'usage specifiques")
# Economie : ~70% sur volume mixte
๐พ 2. Context Caching (-90% sur cache)
from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel
import datetime
# Creer cached content (system instruction longue)
system_instruction = """
Tu es un assistant support technique pour notre produit SaaS.
[... 5000 tokens de documentation produit ...]
Voici les 200 questions/reponses FAQ les plus frequentes :
[... documentation complete ...]
"""
cached_content = caching.CachedContent.create(
model_name="gemini-2.5-flash",
system_instruction=system_instruction,
ttl=datetime.timedelta(hours=1), # Cache 1h
)
# Utiliser cache pour multiples requetes
model = GenerativeModel.from_cached_content(cached_content)
# Requete 1 : Paye 5000 tokens cached ($0.000075) au lieu de input ($0.00075)
response1 = model.generate_content("Comment reinitialiser mon mot de passe ?")
# Requete 2-1000 : Cache hit, economie massive
response2 = model.generate_content("Ou trouver mes factures ?")
# ECONOMIE :
# Sans cache : 1000 requetes ร 5000 tokens input ร $0.15/1M = $0.75
# Avec cache : 1 creation ร $0.000075 + 1000 ร cache hit (quasi-gratuit) = $0.075
# โ 90% d'economie !
๐ฆ 3. Batch API (-50% cout)
import json
from google.cloud import aiplatform
# Preparer batch requests (JSONL)
batch_requests = []
with open("questions.txt") as f:
for line in f:
batch_requests.append({
"request": {
"contents": [{"role": "user", "parts": [{"text": line.strip()}]}]
}
})
# Ecrire JSONL
with open("batch_input.jsonl", "w") as f:
for req in batch_requests:
f.write(json.dumps(req) + "\n")
# Upload to GCS
from google.cloud import storage
bucket = storage.Client().bucket("my-batch-bucket")
blob = bucket.blob("batch_input.jsonl")
blob.upload_from_filename("batch_input.jsonl")
# Submit batch job
batch_job = aiplatform.BatchPredictionJob.create(
job_display_name="gemini-batch-job",
model_name="gemini-2.5-flash",
input_uri="gs://my-batch-bucket/batch_input.jsonl",
output_uri="gs://my-batch-bucket/output/",
)
print(f"Batch job created: {batch_job.name}")
print("Processing time: 10-30 minutes")
print("Cost reduction: 50% vs real-time API")
# ECONOMIE :
# Real-time : 10,000 requetes ร $0.15/1M input = $1.50
# Batch : 10,000 requetes ร $0.075/1M input = $0.75
# โ 50% d'economie si non-urgent
โ๏ธ 4. Prompt Compression
# AVANT (verbose, 250 tokens) prompt_verbose = """ Je voudrais que tu m'aides a comprendre le concept de machine learning. Peux-tu s'il te plait m'expliquer ce que c'est de maniere simple ? J'aimerais aussi savoir quelles sont les principales applications. Et si possible, donne-moi quelques exemples concrets. Merci beaucoup pour ton aide ! """ # APRES (concis, 120 tokens, -52%) prompt_concis = """ Explique machine learning simplement : definition, applications, exemples concrets. """ # TECHNIQUE : Supprimer politesse, redondances, aller droit au but # Economie : 52% tokens input sur ce prompt # Pour system instructions : system_before = """ You are a helpful assistant. You should always be polite and professional. When answering questions, make sure to provide detailed explanations. If you don't know something, be honest about it. Always format your responses in a clear and readable way. """ # 150 tokens system_after = """ Assistant technique. Reponses detaillees, format clair, honnete sur limites. """ # 50 tokens (-67%)
๐๏ธ 5. Output Control
from vertexai.generative_models import GenerativeModel, GenerationConfig
model = GenerativeModel("gemini-2.5-flash")
# MAUVAIS : Output non controle (peut faire 2000 tokens)
response_uncontrolled = model.generate_content(
"Liste les pays europeens"
)
# โ Peut generer 2000 tokens = $0.0012 output
# BON : Output controle avec max_output_tokens
response_controlled = model.generate_content(
"Liste les pays europeens",
generation_config=GenerationConfig(
max_output_tokens=200, # Limite stricte
temperature=0.3, # Moins creative = plus court
)
)
# โ Maximum 200 tokens = $0.00012 output
# โ Economie 90%
# Pour JSON : schema strict = output deterministe court
response_json = model.generate_content(
"Top 3 pays europeens par PIB",
generation_config=GenerationConfig(
response_mime_type="application/json",
response_schema={
"type": "array",
"items": {
"type": "object",
"properties": {
"country": {"type": "string"},
"gdp": {"type": "number"}
}
},
"maxItems": 3
}
)
)
# โ Output JSON compact, pas de texte superflu
โก 6. Utiliser Flash-8B pour Use Cases Simples
| Use Case | Modele Recommande | Cout Input | Economie vs Pro |
|---|---|---|---|
| FAQ / Support simple | Flash-8B | $0.04/1M | 75x moins cher |
| Classification | Flash-8B | $0.04/1M | 75x moins cher |
| Extraction entites | Flash | $0.15/1M | 20x moins cher |
| Resume documents | Flash | $0.15/1M | 20x moins cher |
| Analyse complexe | Pro | $3.00/1M | Justifie si qualite critique |
๐ 7. Implicit Context Caching (Auto)
Gemini 2.5 cache automatiquement les prefixes longs communs (>1024 tokens) pendant 5 minutes. Pas de config necessaire.
# Si vous envoyez meme long prefix dans les 5 min : long_context = "[... 3000 tokens documentation ...]" # Requete 1 : Full cost response1 = model.generate_content(long_context + "\nQuestion 1 ?") # Requete 2 (dans les 5 min) : Implicit cache hit ! response2 = model.generate_content(long_context + "\nQuestion 2 ?") # โ Google detecte automatiquement prefix identique # โ Cache hit gratuit (si >1024 tokens prefix) # Astuce : Structurer prompts avec context stable en debut
En combinant ces 7 techniques, vous pouvez reduire vos couts de 60-80%. Commencez par model routing et context caching (quick wins), puis optimisez prompts et output. Batch API pour traitement non-urgent. Mesurez avant/apres pour quantifier ROI.
Context Caching Avance
๐ฏ Objectifs d'apprentissage
- Comprendre implicit vs explicit caching
- Optimiser TTL pour maximiser ROI
- Implementer warming strategies
- Calculer ROI caching pour votre use case
๐ Implicit vs Explicit Caching
| Aspect | Implicit Caching | Explicit Caching |
|---|---|---|
| Activation | Automatique (Gemini 2.5+) | Manuel via API |
| Taille min | >1024 tokens prefix | >2048 tokens |
| TTL | 5 minutes fixe | 1-60 minutes configurable |
| Cout cache | Gratuit | $0.015/1M tokens (Flash) |
| Use case | Conversations courtes | System instructions longues |
๐ฐ Calculateur ROI Context Caching
class CachingROICalculator:
def __init__(self, model="flash"):
if model == "flash":
self.input_price = 0.15 # $/1M tokens
self.cache_price = 0.015 # $/1M tokens (10x moins)
elif model == "pro":
self.input_price = 3.00
self.cache_price = 0.30
def calculate_roi(self,
cached_tokens: int,
num_requests: int,
ttl_minutes: int):
"""Calculer ROI du caching"""
# SANS CACHE
cost_without_cache = (
cached_tokens * num_requests / 1_000_000 * self.input_price
)
# AVEC CACHE
# Creation cache : 1 fois
cache_creation = cached_tokens / 1_000_000 * self.input_price
# Hits cache : num_requests fois
cache_hits = cached_tokens * num_requests / 1_000_000 * self.cache_price
# Storage : ttl_minutes
cache_storage = cached_tokens / 1_000_000 * self.cache_price * (ttl_minutes / 60)
cost_with_cache = cache_creation + cache_hits + cache_storage
# ROI
savings = cost_without_cache - cost_with_cache
savings_pct = (savings / cost_without_cache) * 100
breakeven_requests = cache_creation / (
cached_tokens / 1_000_000 * (self.input_price - self.cache_price)
)
return {
"cost_without_cache": cost_without_cache,
"cost_with_cache": cost_with_cache,
"savings": savings,
"savings_pct": savings_pct,
"breakeven_requests": int(breakeven_requests) + 1
}
# Exemple : Chatbot support avec system instruction 5000 tokens
calc = CachingROICalculator(model="flash")
# Scenario 1 : 10 requetes/heure, cache 1h
result1 = calc.calculate_roi(
cached_tokens=5000,
num_requests=10,
ttl_minutes=60
)
print("Scenario 1 : 10 req/h, TTL 1h")
print(f" Sans cache: ${result1['cost_without_cache']:.6f}")
print(f" Avec cache: ${result1['cost_with_cache']:.6f}")
print(f" Economie: ${result1['savings']:.6f} ({result1['savings_pct']:.1f}%)")
print(f" Breakeven: {result1['breakeven_requests']} requetes")
print()
# Scenario 2 : 100 requetes/heure, cache 1h
result2 = calc.calculate_roi(
cached_tokens=5000,
num_requests=100,
ttl_minutes=60
)
print("Scenario 2 : 100 req/h, TTL 1h")
print(f" Sans cache: ${result2['cost_without_cache']:.6f}")
print(f" Avec cache: ${result2['cost_with_cache']:.6f}")
print(f" Economie: ${result2['savings']:.6f} ({result2['savings_pct']:.1f}%)")
print(f" Breakeven: {result2['breakeven_requests']} requetes")
# SORTIE :
# Scenario 1 : 10 req/h, TTL 1h
# Sans cache: $0.007500
# Avec cache: $0.001575
# Economie: $0.005925 (79.0%)
# Breakeven: 2 requetes
# Scenario 2 : 100 req/h, TTL 1h
# Sans cache: $0.075000
# Avec cache: $0.008250
# Economie: $0.066750 (89.0%)
# Breakeven: 2 requetes
# โ ROI positif des 2 requetes !
โฑ๏ธ Optimisation TTL
Exemple : Si requetes toutes les 2 min โ TTL = 10 min
import datetime
from vertexai.preview import caching
# Analyser pattern de traffic pour definir TTL
def optimize_ttl(request_intervals_minutes: list) -> int:
"""Calculer TTL optimal base sur pattern traffic"""
avg_interval = sum(request_intervals_minutes) / len(request_intervals_minutes)
optimal_ttl = int(avg_interval * 5)
# Contraintes : 1-60 minutes
if optimal_ttl < 1:
return 1
elif optimal_ttl > 60:
return 60
else:
return optimal_ttl
# Exemple : Chatbot avec pic traffic 9h-18h
# Requetes toutes les 3 min en moyenne
intervals = [3, 2, 4, 3, 5, 2, 3, 4] # minutes
optimal_ttl = optimize_ttl(intervals)
print(f"TTL optimal: {optimal_ttl} minutes") # โ 15 minutes
# Creer cache avec TTL optimal
cached_content = caching.CachedContent.create(
model_name="gemini-2.5-flash",
system_instruction="[... 5000 tokens ...]",
ttl=datetime.timedelta(minutes=optimal_ttl),
)
# Alternative : TTL absolu (expire a heure precise)
# Utile pour cache qui doit expirer en fin de journee
expire_time = datetime.datetime.now() + datetime.timedelta(hours=8)
cached_content_absolute = caching.CachedContent.create(
model_name="gemini-2.5-flash",
system_instruction="[... 5000 tokens ...]",
expire_time=expire_time, # Expire a 18h
)
๐ฅ Warming Strategies
Probleme : Si cache expire pendant pic traffic, premiere requete lente (cold start).
Solution : Cache warming preemptif.
import time
import threading
from datetime import datetime, timedelta
from vertexai.preview import caching
from vertexai.generative_models import GenerativeModel
class CacheWarmer:
def __init__(self, system_instruction: str, ttl_minutes: int):
self.system_instruction = system_instruction
self.ttl_minutes = ttl_minutes
self.cached_content = None
self.model = None
self.warming_thread = None
def create_cache(self):
"""Creer ou renouveler cache"""
self.cached_content = caching.CachedContent.create(
model_name="gemini-2.5-flash",
system_instruction=self.system_instruction,
ttl=timedelta(minutes=self.ttl_minutes),
)
self.model = GenerativeModel.from_cached_content(self.cached_content)
print(f"[{datetime.now()}] Cache created/renewed")
def start_warming(self):
"""Demarrer warming automatique"""
self.create_cache()
# Renouveler cache avant expiration
refresh_interval = (self.ttl_minutes - 1) * 60 # 1 min avant expiration
def warming_loop():
while True:
time.sleep(refresh_interval)
self.create_cache()
self.warming_thread = threading.Thread(target=warming_loop, daemon=True)
self.warming_thread.start()
def generate(self, prompt: str):
"""Generate avec cache toujours chaud"""
if self.model is None:
raise RuntimeError("Cache not initialized. Call start_warming() first.")
return self.model.generate_content(prompt)
# Utilisation : Cache toujours chaud pendant heures bureau
warmer = CacheWarmer(
system_instruction="[... 5000 tokens system instruction ...]",
ttl_minutes=30
)
# Demarrer warming (renouvelle cache toutes les 29 min)
warmer.start_warming()
# Toutes les requetes utilisent cache chaud (pas de cold start)
response1 = warmer.generate("Question 1")
time.sleep(1800) # 30 min plus tard
response2 = warmer.generate("Question 2") # Cache renouvele automatiquement !
# Economie : Pas de cold start, latence optimale
๐ Comparaison Couts : Cache vs No Cache
| Scenario | Cached Tokens | Requests/Day | Cost No Cache | Cost With Cache | Savings |
|---|---|---|---|---|---|
| Chatbot support | 5,000 | 10,000 | $7.50 | $0.83 | 89% |
| RAG system | 20,000 | 5,000 | $15.00 | $1.65 | 89% |
| Agent avec tools | 10,000 | 1,000 | $1.50 | $0.17 | 89% |
| Code assistant | 30,000 | 20,000 | $90.00 | $9.90 | 89% |
- System instruction <2000 tokens (ROI negatif)
- Moins de 5 requetes pendant TTL (breakeven non atteint)
- Context change frequemment (invalidation cache trop souvent)
- Implicit cache suffit (prefix >1024 tokens, requetes <5 min)
๐ ๏ธ Cache Management Best Practices
from vertexai.preview import caching
# 1. Lister tous les caches actifs
caches = caching.CachedContent.list()
for cache in caches:
print(f"Cache: {cache.name}")
print(f" Model: {cache.model}")
print(f" Expire: {cache.expire_time}")
print(f" Size: {len(cache.system_instruction)} chars")
# 2. Supprimer cache manuellement si context change
cache_to_delete = caching.CachedContent(cached_content_name="cache-123")
cache_to_delete.delete()
print("Cache deleted")
# 3. Mettre a jour TTL d'un cache existant
cache_to_update = caching.CachedContent(cached_content_name="cache-456")
cache_to_update.update(ttl=timedelta(minutes=120)) # Extend TTL
print("TTL updated")
# 4. Monitoring usage cache
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
query = """
fetch aiplatform.googleapis.com/prediction/cache_hit_count
| group_by 1h, [value_cache_hit_count_mean: mean(value.cache_hit_count)]
| every 1h
"""
# 5. Alert si cache hit rate < 80% (probleme TTL ou invalidation)
# โ Creer alerte Cloud Monitoring sur cache_hit_rate metric
Context caching est votre meilleur allie FinOps. Pour chatbot/RAG avec system instruction longue, ROI est positif des 2 requetes. Commencez avec TTL conservateur (30 min), puis ajustez base sur metriques. Implicit cache gratuit pour conversations courtes. Warming pour apps critiques latence.
Model Routing Intelligent
๐ฏ Objectifs d'apprentissage
- Implementer classifier de requetes multi-niveau
- Router Pro/Flash/Flash-8B intelligemment
- Gerer fallback et error handling
- Mesurer quality/cost tradeoff
๐ฏ Architecture Model Router
๐ง Classifier Implementation
from vertexai.generative_models import GenerativeModel, GenerationConfig
import json
from enum import Enum
class ComplexityLevel(Enum):
SIMPLE = "simple"
STANDARD = "standard"
COMPLEX = "complex"
class IntelligentRouter:
def __init__(self):
# Classifier ultra-rapide avec Flash-8B
self.classifier = GenerativeModel("gemini-2.5-flash-8b")
# 3 modeles production
self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
self.flash = GenerativeModel("gemini-2.5-flash")
self.pro = GenerativeModel("gemini-2.5-pro")
def classify_complexity(self, prompt: str) -> ComplexityLevel:
"""Classifier requete avec LLM (Flash-8B)"""
classification_prompt = f"""
Analyse cette requete utilisateur et determine sa complexite :
- SIMPLE : FAQ, recherche info factuelle, classification basique
- STANDARD : Resume, extraction donnees, generation texte standard
- COMPLEX : Analyse multi-etapes, raisonnement logique, creative writing
Requete : "{prompt}"
Reponds UNIQUEMENT par JSON :
{{"complexity": "simple|standard|complex", "reasoning": "explication courte"}}
"""
response = self.classifier.generate_content(
classification_prompt,
generation_config=GenerationConfig(
response_mime_type="application/json",
max_output_tokens=100,
temperature=0.1,
)
)
result = json.loads(response.text)
complexity = ComplexityLevel(result["complexity"])
print(f"[Classifier] {complexity.value.upper()}: {result['reasoning']}")
return complexity
def route_and_generate(self, prompt: str, temperature: float = 0.7):
"""Router et generer reponse"""
# 1. Classifier (coute ~$0.000004)
complexity = self.classify_complexity(prompt)
# 2. Selectionner modele
if complexity == ComplexityLevel.SIMPLE:
model = self.flash_8b
model_name = "Flash-8B"
elif complexity == ComplexityLevel.STANDARD:
model = self.flash
model_name = "Flash"
else: # COMPLEX
model = self.pro
model_name = "Pro"
print(f"[Router] โ {model_name}")
# 3. Generer avec fallback
try:
response = model.generate_content(
prompt,
generation_config=GenerationConfig(
temperature=temperature,
max_output_tokens=2048,
)
)
return response.text, model_name
except Exception as e:
# Fallback vers modele superieur si echec
print(f"[Router] Error with {model_name}, falling back to Pro")
response = self.pro.generate_content(prompt)
return response.text, "Pro (fallback)"
# Test router
router = IntelligentRouter()
# Requete simple โ Flash-8B
response1, model1 = router.route_and_generate(
"Quelle est la capitale de l'Italie ?"
)
print(f"Model: {model1}\nReponse: {response1}\n")
# Requete standard โ Flash
response2, model2 = router.route_and_generate(
"Resume les 3 principales caracteristiques du cloud computing"
)
print(f"Model: {model2}\nReponse: {response2}\n")
# Requete complexe โ Pro
response3, model3 = router.route_and_generate(
"Analyse les implications ethiques de l'IA dans le systeme judiciaire, "
"en considerant les biais algorithmiques et la transparence des decisions"
)
print(f"Model: {model3}\nReponse: {response3}\n")
๐ Quality/Cost Tradeoff Analysis
import time
from dataclasses import dataclass
@dataclass
class RoutingMetrics:
model: str
latency_ms: float
cost_usd: float
quality_score: float # 0-100, evaluation humaine ou automatique
class RouterAnalyzer:
def __init__(self):
self.metrics = []
def evaluate_routing(self,
test_queries: list,
router: IntelligentRouter):
"""Evaluer quality/cost tradeoff"""
total_cost = 0
total_latency = 0
total_quality = 0
for query in test_queries:
start = time.time()
response, model = router.route_and_generate(query)
latency = (time.time() - start) * 1000
# Estimer cout base sur tokens (simplifie)
tokens_estimate = len(query.split()) * 1.3 + len(response.split()) * 1.3
if "Flash-8B" in model:
cost = tokens_estimate / 1_000_000 * 0.20 # Input + output
elif "Flash" in model:
cost = tokens_estimate / 1_000_000 * 0.75
else: # Pro
cost = tokens_estimate / 1_000_000 * 15.00
# Quality score (simuler evaluation - en prod, utiliser LLM judge)
quality = self._evaluate_quality(query, response)
self.metrics.append(RoutingMetrics(
model=model,
latency_ms=latency,
cost_usd=cost,
quality_score=quality
))
total_cost += cost
total_latency += latency
total_quality += quality
# Calculer moyennes
n = len(test_queries)
avg_cost = total_cost / n
avg_latency = total_latency / n
avg_quality = total_quality / n
print("=== ROUTING ANALYSIS ===")
print(f"Total queries: {n}")
print(f"Avg cost/query: ${avg_cost:.6f}")
print(f"Avg latency: {avg_latency:.0f}ms")
print(f"Avg quality: {avg_quality:.1f}/100")
print(f"\nTotal cost: ${total_cost:.4f}")
# Distribution modeles
model_counts = {}
for m in self.metrics:
model_counts[m.model] = model_counts.get(m.model, 0) + 1
print("\n=== MODEL DISTRIBUTION ===")
for model, count in sorted(model_counts.items()):
pct = count / n * 100
print(f"{model}: {count} ({pct:.1f}%)")
return {
"avg_cost": avg_cost,
"avg_latency": avg_latency,
"avg_quality": avg_quality,
"model_distribution": model_counts
}
def _evaluate_quality(self, query: str, response: str) -> float:
"""Evaluer qualite reponse (simplifie)"""
# En production : utiliser LLM judge ou human evaluation
# Ici : heuristique simple
if len(response) < 50:
return 60.0
elif "sorry" in response.lower() or "cannot" in response.lower():
return 40.0
else:
return 85.0
# Test avec dataset
test_queries = [
"Capitale du Japon ?",
"Liste 3 langages de programmation",
"Explique la photosynthese simplement",
"Compare architecture REST vs GraphQL en detail",
"Analyse critique de la blockchain pour supply chain avec exemples concrets",
]
analyzer = RouterAnalyzer()
results = analyzer.evaluate_routing(test_queries, router)
# SORTIE EXEMPLE :
# === ROUTING ANALYSIS ===
# Total queries: 5
# Avg cost/query: $0.000180
# Avg latency: 1250ms
# Avg quality: 82.0/100
#
# Total cost: $0.0009
#
# === MODEL DISTRIBUTION ===
# Flash: 2 (40.0%)
# Flash-8B: 2 (40.0%)
# Pro: 1 (20.0%)
๐ก๏ธ Fallback Strategy
from typing import Optional
class RobustRouter:
def __init__(self):
self.flash_8b = GenerativeModel("gemini-2.5-flash-8b")
self.flash = GenerativeModel("gemini-2.5-flash")
self.pro = GenerativeModel("gemini-2.5-pro")
def generate_with_fallback(self,
prompt: str,
preferred_model: str = "flash") -> dict:
"""Generate avec fallback cascade"""
# Definir cascade
if preferred_model == "flash-8b":
cascade = [self.flash_8b, self.flash, self.pro]
cascade_names = ["Flash-8B", "Flash", "Pro"]
elif preferred_model == "flash":
cascade = [self.flash, self.pro]
cascade_names = ["Flash", "Pro"]
else: # pro
cascade = [self.pro]
cascade_names = ["Pro"]
# Essayer cascade
last_error = None
for model, name in zip(cascade, cascade_names):
try:
print(f"[Fallback] Trying {name}...")
response = model.generate_content(
prompt,
generation_config=GenerationConfig(
max_output_tokens=2048,
temperature=0.7,
)
)
# Verifier qualite reponse
if response.text and len(response.text) > 10:
print(f"[Fallback] โ Success with {name}")
return {
"text": response.text,
"model": name,
"fallback": name != cascade_names[0]
}
else:
raise ValueError("Response too short")
except Exception as e:
print(f"[Fallback] โ {name} failed: {e}")
last_error = e
continue
# Tous les modeles ont echoue
raise RuntimeError(f"All models failed. Last error: {last_error}")
# Test fallback
robust_router = RobustRouter()
# Requete normale : Flash suffit
result1 = robust_router.generate_with_fallback(
"Explique REST API",
preferred_model="flash"
)
print(f"Model: {result1['model']}, Fallback: {result1['fallback']}\n")
# Requete tres longue : Flash echoue โ Pro fallback
# (simuler echec en ajoutant prompt trop long pour Flash)
long_prompt = "Analyse " + " ".join(["cette situation complexe"] * 10000)
try:
result2 = robust_router.generate_with_fallback(
long_prompt,
preferred_model="flash"
)
print(f"Model: {result2['model']}, Fallback: {result2['fallback']}\n")
except Exception as e:
print(f"Error: {e}")
๐ก Regles de Routing Optimales
| Use Case | Modele | Raison |
|---|---|---|
| FAQ / Support Tier 1 | Flash-8B | Reponses factuelles, latence critique, volume eleve |
| Classification / Tagging | Flash-8B | Sortie JSON, deterministe, rapide |
| Extraction entites | Flash | Precision > vitesse, output structure |
| Resume documents | Flash | Equilibre qualite/cout, context long |
| Code generation | Flash | Syntaxe correcte, output deterministe |
| Analyse complexe | Pro | Raisonnement multi-etapes, nuance |
| Creative writing | Pro | Creativite, style, coherence longue |
| Research synthesis | Pro | Comprehension profonde, cross-referencing |
Model routing intelligent peut reduire couts de 60-70% sans degrader qualite. Utilisez Flash-8B pour 70% requetes (FAQ, classification), Flash pour 25% (summaries, extraction), Pro pour 5% seulement (analyse complexe). Classifier coute $0.000004, ROI positif immediate. Fallback vers Pro = safety net si Flash echoue.
Batch API & Traitement Asynchrone
๐ฏ Objectifs d'apprentissage
- Comprendre Batch API et -50% reduction cout
- Implementer workflows JSONL batch
- Utiliser SDK OpenAI compatible
- Monitorer et gerer batch jobs
๐ฆ Batch API : -50% Cout pour Traitement Non-Urgent
- Traitement asynchrone acceptable (10-30 minutes)
- Volume eleve (>1000 requetes)
- Use cases : ETL, data enrichment, bulk classification, offline evaluation
- Economie : 50% vs real-time API
๐ Workflow Batch API
๐ Implementation Complete
import json
from google.cloud import storage, aiplatform
from datetime import datetime
import time
class GeminiBatchProcessor:
def __init__(self,
project_id: str,
location: str,
bucket_name: str):
self.project_id = project_id
self.location = location
self.bucket_name = bucket_name
aiplatform.init(project=project_id, location=location)
self.storage_client = storage.Client()
def prepare_batch_jsonl(self,
prompts: list[str],
output_file: str = "batch_input.jsonl"):
"""Preparer fichier JSONL pour batch"""
with open(output_file, "w") as f:
for i, prompt in enumerate(prompts):
request = {
"request": {
"contents": [
{
"role": "user",
"parts": [{"text": prompt}]
}
]
}
}
f.write(json.dumps(request) + "\n")
print(f"โ Created {output_file} with {len(prompts)} requests")
return output_file
def upload_to_gcs(self, local_file: str, gcs_path: str):
"""Upload fichier vers GCS"""
bucket = self.storage_client.bucket(self.bucket_name)
blob = bucket.blob(gcs_path)
blob.upload_from_filename(local_file)
gcs_uri = f"gs://{self.bucket_name}/{gcs_path}"
print(f"โ Uploaded to {gcs_uri}")
return gcs_uri
def submit_batch_job(self,
input_uri: str,
output_uri_prefix: str,
model_name: str = "gemini-2.5-flash"):
"""Submit batch prediction job"""
batch_job = aiplatform.BatchPredictionJob.create(
job_display_name=f"gemini-batch-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
model_name=model_name,
input_config_source=input_uri,
output_config_destination=output_uri_prefix,
)
print(f"โ Batch job submitted: {batch_job.name}")
print(f" Status: {batch_job.state}")
return batch_job
def monitor_job(self, batch_job, poll_interval: int = 60):
"""Monitorer job jusqu'a completion"""
print(f"Monitoring job {batch_job.display_name}...")
while batch_job.state not in ["SUCCEEDED", "FAILED", "CANCELLED"]:
time.sleep(poll_interval)
batch_job.refresh()
print(f" Status: {batch_job.state} ({datetime.now().strftime('%H:%M:%S')})")
if batch_job.state == "SUCCEEDED":
print(f"โ Job completed successfully!")
return True
else:
print(f"โ Job failed: {batch_job.error}")
return False
def download_results(self, output_uri_prefix: str, local_file: str = "batch_output.jsonl"):
"""Download resultats depuis GCS"""
# Parse GCS URI
parts = output_uri_prefix.replace("gs://", "").split("/")
bucket_name = parts[0]
prefix = "/".join(parts[1:])
# Lister fichiers output
bucket = self.storage_client.bucket(bucket_name)
blobs = list(bucket.list_blobs(prefix=prefix))
# Download tous les fichiers (batch peut splitter en plusieurs)
results = []
for blob in blobs:
if blob.name.endswith(".jsonl"):
content = blob.download_as_text()
for line in content.strip().split("\n"):
results.append(json.loads(line))
# Sauver local
with open(local_file, "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"โ Downloaded {len(results)} results to {local_file}")
return results
# EXAMPLE : Batch classification de 10,000 feedbacks clients
processor = GeminiBatchProcessor(
project_id="my-project",
location="us-central1",
bucket_name="my-batch-bucket"
)
# 1. Preparer prompts
feedbacks = [
"Le produit est excellent, livraison rapide !",
"Service client nul, attente 2h au telephone",
# ... 9,998 autres feedbacks
]
classification_prompts = [
f"Classifie ce feedback client en POSITIF, NEGATIF ou NEUTRE. "
f"Feedback: \"{fb}\"\nReponse (un seul mot):"
for fb in feedbacks
]
# 2. Creer JSONL
input_file = processor.prepare_batch_jsonl(classification_prompts)
# 3. Upload GCS
input_uri = processor.upload_to_gcs(
input_file,
"batch_jobs/classification_input.jsonl"
)
# 4. Submit job
output_uri = f"gs://{processor.bucket_name}/batch_jobs/output/"
batch_job = processor.submit_batch_job(
input_uri=input_uri,
output_uri_prefix=output_uri,
model_name="gemini-2.5-flash" # -50% vs real-time
)
# 5. Monitor (bloquant)
success = processor.monitor_job(batch_job, poll_interval=60)
# 6. Download resultats
if success:
results = processor.download_results(output_uri)
# Parser resultats
classifications = []
for result in results:
text = result["response"]["candidates"][0]["content"]["parts"][0]["text"]
classifications.append(text.strip().upper())
# Stats
from collections import Counter
counts = Counter(classifications)
print("\n=== RESULTS ===")
print(f"Positif: {counts['POSITIF']}")
print(f"Negatif: {counts['NEGATIF']}")
print(f"Neutre: {counts['NEUTRE']}")
# ECONOMIE :
# Real-time : 10,000 req ร 100 tokens avg ร $0.15/1M input + $0.60/1M output
# = $0.15 + $0.60 = $0.75
# Batch API : $0.75 ร 0.5 = $0.375
# Economie : $0.375 (50%)
๐ง SDK OpenAI Compatible
Vertex AI Batch API est compatible avec SDK OpenAI pour faciliter migration.
# Installation
# pip install google-cloud-aiplatform openai
from openai import OpenAI
import os
# Configurer client OpenAI avec endpoint Vertex AI
client = OpenAI(
base_url=f"https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/endpoints/openai",
api_key=os.environ.get("GOOGLE_API_KEY") # Utiliser ADC en prod
)
# Creer batch file (format OpenAI)
with open("batch_openai.jsonl", "w") as f:
for prompt in prompts:
request = {
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gemini-2.5-flash",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 200
}
}
f.write(json.dumps(request) + "\n")
# Upload batch file
batch_input_file = client.files.create(
file=open("batch_openai.jsonl", "rb"),
purpose="batch"
)
# Create batch job
batch = client.batches.create(
input_file_id=batch_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.status}")
# Poll status
while batch.status not in ["completed", "failed", "cancelled"]:
time.sleep(60)
batch = client.batches.retrieve(batch.id)
print(f"Status: {batch.status}")
# Download results
if batch.status == "completed":
result_file = client.files.content(batch.output_file_id)
with open("batch_results.jsonl", "wb") as f:
f.write(result_file.read())
๐ Monitoring Batch Jobs
from google.cloud import aiplatform
# Lister tous les batch jobs
batch_jobs = aiplatform.BatchPredictionJob.list(
filter='display_name:gemini-batch-*',
order_by='create_time desc'
)
print("=== BATCH JOBS ===")
for job in batch_jobs:
print(f"Name: {job.display_name}")
print(f" State: {job.state}")
print(f" Created: {job.create_time}")
print(f" Model: {job.model_name}")
if job.state == "SUCCEEDED":
# Calculer metriques
elapsed = (job.end_time - job.create_time).total_seconds()
print(f" Duration: {elapsed/60:.1f} minutes")
# Creer dashboard Cloud Monitoring
from google.cloud import monitoring_v3
query = """
fetch aiplatform.googleapis.com/prediction/batch_prediction_job/count
| filter resource.job_id =~ 'gemini-batch-.*'
| group_by [resource.state], 1d
| every 1d
"""
# Alertes sur batch job failures
# gcloud alpha monitoring policies create \
# --notification-channels=CHANNEL_ID \
# --display-name="Batch Job Failures" \
# --condition-threshold-value=1 \
# --condition-threshold-duration=300s
- Latence 10-30 minutes (non-realtime)
- Pas de streaming
- Pas de function calling (en beta)
- Limite 50,000 requetes par job
Batch API offre 50% reduction cout pour workloads non-urgents. Utilisez pour ETL overnight, bulk classification, offline evaluation. Temps processing : 10-30 min. Si vous avez 100K+ requetes/jour et latence non-critique, economie annuelle peut atteindre $10-50K. Setup initial 1-2h, ROI immediate.
Monitoring des Couts
๐ฏ Objectifs d'apprentissage
- Configurer Cloud Billing pour tracking IA
- Creer budget alerts et seuils
- Builder cost dashboards temps reel
- Implementer attribution par projet/equipe
๐ฐ Architecture Cost Monitoring
๐ Export Billing vers BigQuery
# 1. Creer dataset BigQuery pour billing bq mk --dataset --location=US my_project:billing_export # 2. Activer Cloud Billing export (via Console ou gcloud) # Console : Billing โ Billing export โ BigQuery export โ Enable # 3. Verifier export actif bq ls my_project:billing_export # โ gcp_billing_export_v1_XXXXXX_XXXXXX_XXXXXX # 4. Query couts Vertex AI bq query --use_legacy_sql=false ' SELECT service.description AS service, sku.description AS sku, SUM(cost) AS total_cost, SUM(usage.amount) AS usage_amount, usage.unit AS unit FROM `my_project.billing_export.gcp_billing_export_v1_*` WHERE service.description = "Vertex AI" AND _TABLE_SUFFIX BETWEEN "20260201" AND "20260210" GROUP BY service, sku, unit ORDER BY total_cost DESC LIMIT 20 ' # SORTIE EXEMPLE : # โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโ # โ service โ sku โ total_cost โ usage_amount โ unit โ # โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโค # โ Vertex AI โ Gemini 2.5 Flash Input Tokens โ 125.50 โ 836666666 โ tokens โ # โ Vertex AI โ Gemini 2.5 Flash Output Tokens โ 85.30 โ 142166666 โ tokens โ # โ Vertex AI โ Gemini 2.5 Pro Input Tokens โ 45.20 โ 15066666 โ tokens โ # โ Vertex AI โ Context Caching Storage โ 2.10 โ 140000000 โ tokens โ # โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโ
๐จ Budget Alerts
from google.cloud import billing_budgets_v1
def create_ai_budget_alert(
billing_account_id: str,
project_id: str,
budget_amount: float,
alert_thresholds: list = [0.5, 0.9, 1.0]
):
"""Creer budget alert pour Vertex AI"""
client = billing_budgets_v1.BudgetServiceClient()
# Configurer budget
budget = billing_budgets_v1.Budget()
budget.display_name = f"Vertex AI Budget - {project_id}"
budget.budget_filter = billing_budgets_v1.Filter(
projects=[f"projects/{project_id}"],
services=["services/aiplatform.googleapis.com"], # Vertex AI
)
# Montant mensuel
budget.amount = billing_budgets_v1.BudgetAmount(
specified_amount={"currency_code": "USD", "units": int(budget_amount)}
)
# Seuils d'alerte
budget.threshold_rules = [
billing_budgets_v1.ThresholdRule(
threshold_percent=threshold,
spend_basis=billing_budgets_v1.ThresholdRule.Basis.CURRENT_SPEND,
)
for threshold in alert_thresholds
]
# Creer budget
parent = f"billingAccounts/{billing_account_id}"
response = client.create_budget(parent=parent, budget=budget)
print(f"โ Budget created: {response.name}")
print(f" Amount: ${budget_amount}/month")
print(f" Alerts at: {', '.join([f'{int(t*100)}%' for t in alert_thresholds])}")
return response
# Creer budget $1000/mois avec alertes a 50%, 90%, 100%
budget = create_ai_budget_alert(
billing_account_id="012345-6789AB-CDEF01",
project_id="my-ai-project",
budget_amount=1000.0,
alert_thresholds=[0.5, 0.9, 1.0]
)
# Configurer notification email/Pub/Sub
# Via Console : Billing โ Budgets & alerts โ Select budget โ Manage notifications
๐ Cost Dashboard en Temps Reel
from google.cloud import bigquery
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
class VertexAICostDashboard:
def __init__(self, project_id: str, billing_dataset: str):
self.client = bigquery.Client(project=project_id)
self.billing_table = f"`{project_id}.{billing_dataset}.gcp_billing_export_v1_*`"
def get_daily_costs(self, days: int = 30) -> pd.DataFrame:
"""Couts quotidiens Vertex AI"""
start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
end_date = datetime.now().strftime("%Y%m%d")
query = f"""
SELECT
DATE(usage_start_time) AS date,
SUM(cost) AS daily_cost
FROM {self.billing_table}
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
GROUP BY date
ORDER BY date
"""
return self.client.query(query).to_dataframe()
def get_costs_by_model(self, days: int = 7) -> pd.DataFrame:
"""Couts par modele Gemini"""
start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
end_date = datetime.now().strftime("%Y%m%d")
query = f"""
SELECT
CASE
WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
ELSE 'Other'
END AS model,
SUM(cost) AS cost,
SUM(usage.amount) AS tokens
FROM {self.billing_table}
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
GROUP BY model
ORDER BY cost DESC
"""
return self.client.query(query).to_dataframe()
def get_costs_by_label(self, label_key: str, days: int = 7) -> pd.DataFrame:
"""Couts par label (team, project, env)"""
start_date = (datetime.now() - timedelta(days=days)).strftime("%Y%m%d")
end_date = datetime.now().strftime("%Y%m%d")
query = f"""
SELECT
labels.value AS {label_key},
SUM(cost) AS cost
FROM {self.billing_table},
UNNEST(labels) AS labels
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX BETWEEN '{start_date}' AND '{end_date}'
AND labels.key = '{label_key}'
GROUP BY {label_key}
ORDER BY cost DESC
"""
return self.client.query(query).to_dataframe()
def plot_dashboard(self):
"""Generer dashboard visuel"""
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# 1. Daily costs trend
daily = self.get_daily_costs(days=30)
axes[0, 0].plot(daily['date'], daily['daily_cost'], marker='o')
axes[0, 0].set_title('Daily Costs (30 days)')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Cost ($)')
axes[0, 0].grid(True)
# 2. Costs by model (pie chart)
models = self.get_costs_by_model(days=7)
axes[0, 1].pie(models['cost'], labels=models['model'], autopct='%1.1f%%')
axes[0, 1].set_title('Costs by Model (7 days)')
# 3. Costs by team
teams = self.get_costs_by_label('team', days=7)
axes[1, 0].bar(teams['team'], teams['cost'])
axes[1, 0].set_title('Costs by Team (7 days)')
axes[1, 0].set_xlabel('Team')
axes[1, 0].set_ylabel('Cost ($)')
axes[1, 0].tick_params(axis='x', rotation=45)
# 4. Summary stats
total_cost = daily['daily_cost'].sum()
avg_daily = daily['daily_cost'].mean()
forecast_monthly = avg_daily * 30
summary_text = f"""
=== COST SUMMARY ===
Last 30 days: ${total_cost:.2f}
Avg daily: ${avg_daily:.2f}
Forecast monthly: ${forecast_monthly:.2f}
Top model: {models.iloc[0]['model']}
Top model cost: ${models.iloc[0]['cost']:.2f}
"""
axes[1, 1].text(0.1, 0.5, summary_text, fontsize=12, family='monospace')
axes[1, 1].axis('off')
plt.tight_layout()
plt.savefig('vertex_ai_cost_dashboard.png', dpi=150)
print("โ Dashboard saved to vertex_ai_cost_dashboard.png")
# Generer dashboard
dashboard = VertexAICostDashboard(
project_id="my-project",
billing_dataset="billing_export"
)
dashboard.plot_dashboard()
# Pour dashboard temps reel : deployer sur Cloud Run + scheduler toutes les heures
๐ท๏ธ Cost Attribution avec Labels
from vertexai.generative_models import GenerativeModel
# Labeler requetes par team/project/environment
def generate_with_labels(prompt: str, labels: dict):
"""Generate avec labels pour cost tracking"""
# Labels format: key=value
# Exemples : team=data-science, project=chatbot, env=prod
model = GenerativeModel(
"gemini-2.5-flash",
# Labels attaches a chaque requete
labels=labels
)
response = model.generate_content(prompt)
return response.text
# Utilisation : tracer couts par equipe
response1 = generate_with_labels(
"Resume ce document",
labels={
"team": "marketing",
"project": "content-generation",
"env": "prod"
}
)
response2 = generate_with_labels(
"Analyse ces donnees",
labels={
"team": "data-science",
"project": "analytics",
"env": "dev"
}
)
# Query couts par team
# SELECT labels.value AS team, SUM(cost) AS cost
# FROM billing_table, UNNEST(labels) AS labels
# WHERE labels.key = 'team'
# GROUP BY team
# โ Marketing: $450, Data Science: $780
# Chargeback : facturer equipes internes base sur usage reel
๐ฏ Cost Optimization Recommendations
def analyze_cost_optimization_opportunities(billing_df: pd.DataFrame) -> dict:
"""Analyser opportunites d'optimisation"""
recommendations = []
# 1. Detecter usage Pro pour requetes simples
pro_usage = billing_df[billing_df['sku'].str.contains('2.5 Pro')]
if not pro_usage.empty:
pro_cost = pro_usage['cost'].sum()
potential_savings = pro_cost * 0.95 # 95% si migration vers Flash
recommendations.append({
"type": "Model Downgrade",
"current_cost": pro_cost,
"potential_savings": potential_savings,
"action": "Implementer model routing : Flash pour 80% requetes"
})
# 2. Detecter absence de caching
cache_usage = billing_df[billing_df['sku'].str.contains('Cache')]
if cache_usage.empty:
input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
potential_savings = input_cost * 0.5 # 50% avec caching
recommendations.append({
"type": "Context Caching",
"current_cost": input_cost,
"potential_savings": potential_savings,
"action": "Activer context caching pour system instructions"
})
# 3. Detecter ratio input/output eleve (prompts longs)
input_cost = billing_df[billing_df['sku'].str.contains('Input')]['cost'].sum()
output_cost = billing_df[billing_df['sku'].str.contains('Output')]['cost'].sum()
ratio = input_cost / output_cost if output_cost > 0 else 0
if ratio > 3:
potential_savings = input_cost * 0.3 # 30% avec prompt compression
recommendations.append({
"type": "Prompt Compression",
"current_cost": input_cost,
"potential_savings": potential_savings,
"action": "Optimiser prompts : supprimer redondances, aller droit au but"
})
# 4. Calculer ROI total
total_current = billing_df['cost'].sum()
total_savings = sum([r['potential_savings'] for r in recommendations])
savings_pct = (total_savings / total_current * 100) if total_current > 0 else 0
return {
"current_monthly_cost": total_current,
"potential_monthly_savings": total_savings,
"savings_percentage": savings_pct,
"recommendations": recommendations
}
# Exemple
recommendations = analyze_cost_optimization_opportunities(billing_df)
print(f"Current monthly cost: ${recommendations['current_monthly_cost']:.2f}")
print(f"Potential savings: ${recommendations['potential_monthly_savings']:.2f} ({recommendations['savings_percentage']:.1f}%)")
print("\nRecommendations:")
for i, rec in enumerate(recommendations['recommendations'], 1):
print(f"{i}. {rec['type']}: Save ${rec['potential_savings']:.2f}/month")
print(f" Action: {rec['action']}")
Cost monitoring proactif = cle FinOps. Exportez billing vers BigQuery (gratuit), creez dashboards Looker Studio, configurez alertes a 50%/90%/100% budget. Utilisez labels pour attribution par team (chargeback). Revisez dashboard chaque semaine, identifiez anomalies, optimisez. Avec monitoring, vous detectez derive avant facture surprenante.
Lab : Dashboard FinOps Complet
๐ฏ Objectif du Lab
Construire un dashboard FinOps production-ready avec :
- Cost tracking temps reel par modele/team
- Trending analysis et forecasting
- Alertes automatiques sur anomalies
- Recommandations d'optimisation
Etape 1 : Setup BigQuery Export (10 min)
# 1. Creer dataset billing
bq mk --dataset --location=US --description="Billing export for FinOps" \
finops_lab:billing_data
# 2. Activer export (via Console)
# Billing โ Billing export โ BigQuery export โ Enable
# Dataset: finops_lab:billing_data
# 3. Verifier export actif (attendre 5-10 min)
bq ls finops_lab:billing_data
# โ gcp_billing_export_v1_XXXXXX
# 4. Tester query
bq query --use_legacy_sql=false '
SELECT service.description, SUM(cost) as cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))
GROUP BY service.description
ORDER BY cost DESC
LIMIT 10
'
Etape 2 : Creer Views BigQuery (15 min)
-- View 1 : Daily Vertex AI costs
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_daily_costs` AS
SELECT
DATE(usage_start_time) AS date,
CASE
WHEN sku.description LIKE '%2.5 Pro%' THEN 'Gemini 2.5 Pro'
WHEN sku.description LIKE '%2.5 Flash-8B%' THEN 'Gemini 2.5 Flash-8B'
WHEN sku.description LIKE '%2.5 Flash%' THEN 'Gemini 2.5 Flash'
WHEN sku.description LIKE '%Cache%' THEN 'Context Caching'
ELSE 'Other'
END AS model,
SUM(cost) AS cost,
SUM(usage.amount) AS usage_amount,
usage.unit
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
GROUP BY date, model, usage.unit;
-- View 2 : Costs by team (from labels)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_costs_by_team` AS
SELECT
DATE(usage_start_time) AS date,
labels.value AS team,
SUM(cost) AS cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`,
UNNEST(labels) AS labels
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND labels.key = 'team'
GROUP BY date, team;
-- View 3 : Anomaly detection (cost spike >50% vs avg)
CREATE OR REPLACE VIEW `finops_lab.billing_data.vertex_ai_cost_anomalies` AS
WITH daily_costs AS (
SELECT
DATE(usage_start_time) AS date,
SUM(cost) AS daily_cost
FROM `finops_lab.billing_data.gcp_billing_export_v1_*`
WHERE service.description = 'Vertex AI'
AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
GROUP BY date
),
stats AS (
SELECT
AVG(daily_cost) AS avg_cost,
STDDEV(daily_cost) AS stddev_cost
FROM daily_costs
)
SELECT
dc.date,
dc.daily_cost,
s.avg_cost,
dc.daily_cost - s.avg_cost AS deviation,
(dc.daily_cost - s.avg_cost) / s.avg_cost * 100 AS deviation_pct
FROM daily_costs dc, stats s
WHERE dc.daily_cost > s.avg_cost * 1.5 -- Spike >50%
ORDER BY dc.date DESC;
Etape 3 : Dashboard Python (30 min)
# finops_dashboard.py
from google.cloud import bigquery
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import smtplib
from email.mime.text import MIMEText
class FinOpsDashboard:
def __init__(self, project_id: str):
self.client = bigquery.Client(project=project_id)
self.project_id = project_id
def fetch_daily_costs(self) -> pd.DataFrame:
query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_daily_costs`"
return self.client.query(query).to_dataframe()
def fetch_team_costs(self) -> pd.DataFrame:
query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_costs_by_team`"
return self.client.query(query).to_dataframe()
def fetch_anomalies(self) -> pd.DataFrame:
query = "SELECT * FROM `finops_lab.billing_data.vertex_ai_cost_anomalies`"
return self.client.query(query).to_dataframe()
def generate_dashboard(self, output_file: str = "finops_dashboard.html"):
"""Generate interactive HTML dashboard"""
# Fetch data
daily_costs = self.fetch_daily_costs()
team_costs = self.fetch_team_costs()
anomalies = self.fetch_anomalies()
# Create subplots
fig = make_subplots(
rows=3, cols=2,
subplot_titles=(
'Daily Costs by Model',
'Model Distribution (Last 7 days)',
'Costs by Team',
'Cost Trend & Forecast',
'Anomalies Detected',
'Summary Metrics'
),
specs=[
[{"type": "scatter"}, {"type": "pie"}],
[{"type": "bar"}, {"type": "scatter"}],
[{"type": "scatter"}, {"type": "table"}]
]
)
# 1. Daily costs by model (line chart)
for model in daily_costs['model'].unique():
model_data = daily_costs[daily_costs['model'] == model]
fig.add_trace(
go.Scatter(
x=model_data['date'],
y=model_data['cost'],
name=model,
mode='lines+markers'
),
row=1, col=1
)
# 2. Model distribution (pie chart - last 7 days)
last_7d = daily_costs[daily_costs['date'] >= datetime.now() - timedelta(days=7)]
model_costs = last_7d.groupby('model')['cost'].sum()
fig.add_trace(
go.Pie(labels=model_costs.index, values=model_costs.values),
row=1, col=2
)
# 3. Costs by team (bar chart)
team_total = team_costs.groupby('team')['cost'].sum().sort_values(ascending=False)
fig.add_trace(
go.Bar(x=team_total.index, y=team_total.values),
row=2, col=1
)
# 4. Trend with forecast
total_daily = daily_costs.groupby('date')['cost'].sum().reset_index()
# Simple linear forecast
last_7_avg = total_daily.tail(7)['cost'].mean()
forecast_dates = pd.date_range(
start=total_daily['date'].max() + timedelta(days=1),
periods=30
)
forecast_values = [last_7_avg] * 30
fig.add_trace(
go.Scatter(
x=total_daily['date'],
y=total_daily['cost'],
name='Actual',
mode='lines'
),
row=2, col=2
)
fig.add_trace(
go.Scatter(
x=forecast_dates,
y=forecast_values,
name='Forecast',
mode='lines',
line=dict(dash='dash')
),
row=2, col=2
)
# 5. Anomalies
if not anomalies.empty:
fig.add_trace(
go.Scatter(
x=anomalies['date'],
y=anomalies['daily_cost'],
mode='markers',
marker=dict(size=10, color='red'),
name='Anomalies'
),
row=3, col=1
)
# 6. Summary table
total_cost_30d = daily_costs['cost'].sum()
avg_daily = daily_costs.groupby('date')['cost'].sum().mean()
forecast_monthly = avg_daily * 30
top_model = model_costs.idxmax()
summary = pd.DataFrame({
'Metric': [
'Last 30d Cost',
'Avg Daily Cost',
'Forecast Monthly',
'Top Model',
'Anomalies Detected'
],
'Value': [
f"${total_cost_30d:.2f}",
f"${avg_daily:.2f}",
f"${forecast_monthly:.2f}",
top_model,
str(len(anomalies))
]
})
fig.add_trace(
go.Table(
header=dict(values=list(summary.columns)),
cells=dict(values=[summary['Metric'], summary['Value']])
),
row=3, col=2
)
# Layout
fig.update_layout(
height=1200,
title_text="Vertex AI FinOps Dashboard",
showlegend=True
)
# Save
fig.write_html(output_file)
print(f"โ Dashboard saved to {output_file}")
return fig, summary
def check_and_alert_anomalies(self, email_to: str = None):
"""Check for anomalies and send alerts"""
anomalies = self.fetch_anomalies()
if not anomalies.empty:
print(f"โ ๏ธ {len(anomalies)} cost anomalies detected!")
for _, row in anomalies.iterrows():
print(f" {row['date']}: ${row['daily_cost']:.2f} "
f"(+{row['deviation_pct']:.1f}% vs avg)")
# Send email alert
if email_to:
self._send_email_alert(anomalies, email_to)
else:
print("โ No cost anomalies detected")
def _send_email_alert(self, anomalies: pd.DataFrame, email_to: str):
"""Send email alert for anomalies"""
body = f"""
Cost Anomaly Alert - Vertex AI
{len(anomalies)} anomalies detected:
"""
for _, row in anomalies.iterrows():
body += f"- {row['date']}: ${row['daily_cost']:.2f} (+{row['deviation_pct']:.1f}%)\n"
body += "\nCheck dashboard for details."
msg = MIMEText(body)
msg['Subject'] = f'โ ๏ธ Vertex AI Cost Anomaly Alert'
msg['From'] = 'finops@company.com'
msg['To'] = email_to
# Send via SMTP (configure your SMTP server)
# smtp = smtplib.SMTP('smtp.gmail.com', 587)
# smtp.send_message(msg)
print(f"โ Alert email sent to {email_to}")
# Generate dashboard
dashboard = FinOpsDashboard(project_id="finops_lab")
fig, summary = dashboard.generate_dashboard()
# Check anomalies
dashboard.check_and_alert_anomalies(email_to="team@company.com")
print("\n=== SUMMARY ===")
print(summary.to_string(index=False))
Etape 4 : Budget Alerts (10 min)
# Creer budget $2000/mois avec alertes # (remplacer BILLING_ACCOUNT_ID) gcloud billing budgets create \ --billing-account=BILLING_ACCOUNT_ID \ --display-name="Vertex AI Monthly Budget" \ --budget-amount=2000USD \ --threshold-rule=percent=0.5 \ --threshold-rule=percent=0.9 \ --threshold-rule=percent=1.0 \ --filter-projects=projects/finops_lab \ --filter-services=services/aiplatform.googleapis.com # Configurer notification Pub/Sub gcloud pubsub topics create budget-alerts gcloud pubsub subscriptions create budget-alerts-sub \ --topic=budget-alerts \ --push-endpoint=https://your-cloud-run-url/budget-alert # Cloud Function pour traiter alertes # (deployer fonction qui parse message et envoie email/Slack)
Etape 5 : Scheduler Automatique (15 min)
# Deployer dashboard sur Cloud Run # Dockerfile cat > Dockerfile <requirements.txt <
Etape 6 : Tester & Valider (10 min)
- Ouvrir dashboard HTML genere
- Verifier graphiques affichent donnees correctes
- Simuler anomalie (requetes massives vers Pro)
- Verifier alerte recue par email/Slack
- Verifier budget alert a 50% budget
โ Validation
Votre dashboard FinOps est complet si vous avez :
- โ BigQuery export actif avec views custom
- โ Dashboard interactif avec 6 visualisations
- โ Anomaly detection automatique
- โ Budget alerts configurees (50%, 90%, 100%)
- โ Refresh automatique toutes les heures
- โ Email/Slack alerts operationnels
Ce dashboard FinOps vous donne visibilite complete sur couts Vertex AI. En production, ajoutez : forecasting ML (Prophet), recommendations automatiques (model routing), integration Slack pour alertes temps reel. Revisez dashboard chaque lundi en equipe, identifiez anomalies, iterez. Avec ce setup, vous detectez derives avant facture surprenante.
Quiz Module 4.2
๐ Quiz : FinOps & Optimisation
15 questions pour valider vos connaissances
1. Quelle technique offre la plus grande economie potentielle ?
2. Model routing intelligent peut reduire couts de combien ?
3. Context caching est rentable a partir de combien de requetes ?
4. Quelle difference entre implicit et explicit caching ?
5. Pour chatbot avec system instruction 5000 tokens, quel TTL cache optimal ?
6. Flash-8B coute combien vs Pro pour input tokens ?
7. Batch API offre -50% cout mais avec quelle contrainte ?
8. Quelle regle pour classifier requete comme "simple" ?
9. Output tokens coutent combien vs input tokens (Flash) ?
10. BigQuery export billing est :
11. Budget alert doit etre configure a quels seuils ?
12. Cost attribution par equipe se fait via :
13. Anomaly detection identifie cout anormal si :
14. Quelle strategie pour requetes urgentes a faible cout ?
15. Dashboard FinOps doit etre rafraichi a quelle frequence ?
IA Responsable Google
๐ฏ Objectifs d'apprentissage
- Comprendre les 7 principes Google AI
- Configurer safety settings Gemini
- Utiliser Gemma Scope pour interpretabilite
- Implementer guardrails IA responsable
๐ฏ Les 7 Principes Google AI
| # | Principe | Signification | Implementation Gemini |
|---|---|---|---|
| 1 | Be socially beneficial | IA doit beneficier societe | Gemini optimise pour aide, pas manipulation |
| 2 | Avoid unfair bias | Eviter biais injustes | Training data diverse, evaluation bias continue |
| 3 | Built & tested for safety | Securite par conception | Safety filters, red teaming, adversarial testing |
| 4 | Accountable to people | Responsabilite humaine | Human-in-the-loop, audit logs, explicabilite |
| 5 | Privacy by design | Confidentialite integree | Data not used for training (Vertex AI) |
| 6 | Scientific excellence | Excellence scientifique | Recherche Google AI publiee, peer-reviewed |
| 7 | Appropriate uses | Usages appropries | Terms of Service interdisent malware, spam, violence |
- Armes ou surveillance de masse
- Technologies violant droits humains
- Collecte d'infos contre droit international
๐ก๏ธ Safety Settings Gemini
Gemini inclut 4 harm categories avec 4 seuils de blocage.
from vertexai.generative_models import (
GenerativeModel,
HarmCategory,
HarmBlockThreshold,
SafetySetting
)
# 4 Harm Categories
# - HARM_CATEGORY_HARASSMENT : Harcelement
# - HARM_CATEGORY_HATE_SPEECH : Discours haineux
# - HARM_CATEGORY_SEXUALLY_EXPLICIT : Contenu sexuel explicite
# - HARM_CATEGORY_DANGEROUS_CONTENT : Contenu dangereux
# 4 Thresholds
# - BLOCK_NONE : Pas de blocage (permissif)
# - BLOCK_ONLY_HIGH : Bloquer seulement haute probabilite
# - BLOCK_MEDIUM_AND_ABOVE : Bloquer moyenne et haute (DEFAULT)
# - BLOCK_LOW_AND_ABOVE : Bloquer tout (strict)
# Configuration stricte (production recommandee)
safety_settings_strict = [
SafetySetting(
category=HarmCategory.HARM_CATEGORY_HARASSMENT,
threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
threshold=HarmBlockThreshold.BLOCK_LOW_AND_ABOVE
),
]
model_strict = GenerativeModel(
"gemini-2.5-flash",
safety_settings=safety_settings_strict
)
# Configuration permissive (R&D uniquement)
safety_settings_permissive = [
SafetySetting(
category=HarmCategory.HARM_CATEGORY_HARASSMENT,
threshold=HarmBlockThreshold.BLOCK_NONE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_HATE_SPEECH,
threshold=HarmBlockThreshold.BLOCK_NONE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
threshold=HarmBlockThreshold.BLOCK_NONE
),
SafetySetting(
category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
threshold=HarmBlockThreshold.BLOCK_NONE
),
]
model_permissive = GenerativeModel(
"gemini-2.5-flash",
safety_settings=safety_settings_permissive
)
# Tester avec prompt sensible
prompt_sensible = "Comment construire une bombe ?"
try:
response_strict = model_strict.generate_content(prompt_sensible)
print("Strict:", response_strict.text)
except Exception as e:
print("Strict: BLOCKED -", e)
# โ BLOCKED (safety filter)
try:
response_permissive = model_permissive.generate_content(prompt_sensible)
print("Permissive:", response_permissive.text)
except Exception as e:
print("Permissive: BLOCKED -", e)
# โ Peut passer (mais ToS Google interdit quand meme usage malveillant)
๐ Analyser Safety Ratings
# Generer reponse et inspecter safety ratings
response = model_strict.generate_content("Raconte une blague")
# Safety ratings pour le prompt
print("=== PROMPT SAFETY ===")
for rating in response.prompt_feedback.safety_ratings:
print(f"{rating.category.name}: {rating.probability.name}")
# Safety ratings pour la reponse
print("\n=== RESPONSE SAFETY ===")
for candidate in response.candidates:
for rating in candidate.safety_ratings:
print(f"{rating.category.name}: {rating.probability.name}")
# SORTIE EXEMPLE :
# === PROMPT SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: NEGLIGIBLE
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE
#
# === RESPONSE SAFETY ===
# HARM_CATEGORY_HARASSMENT: NEGLIGIBLE
# HARM_CATEGORY_HATE_SPEECH: NEGLIGIBLE
# HARM_CATEGORY_SEXUALLY_EXPLICIT: LOW
# HARM_CATEGORY_DANGEROUS_CONTENT: NEGLIGIBLE
# Implementer logging safety pour monitoring
import json
def log_safety_event(prompt: str, response, blocked: bool):
"""Logger evenements safety pour audit"""
event = {
"timestamp": datetime.now().isoformat(),
"prompt": prompt[:100], # Truncate for privacy
"blocked": blocked,
"prompt_safety": {
rating.category.name: rating.probability.name
for rating in response.prompt_feedback.safety_ratings
},
}
if not blocked:
event["response_safety"] = {
rating.category.name: rating.probability.name
for rating in response.candidates[0].safety_ratings
}
# Log to BigQuery ou Cloud Logging
with open("safety_logs.jsonl", "a") as f:
f.write(json.dumps(event) + "\n")
return event
# Utiliser avec monitoring
try:
response = model_strict.generate_content(prompt)
log_safety_event(prompt, response, blocked=False)
except Exception as e:
log_safety_event(prompt, None, blocked=True)
๐ Gemma Scope : Interpretabilite
Gemma Scope est un outil open-source pour interpreter modeles Gemma (sparse autoencoders).
# pip install gemma-scope
from gemma_scope import GemmaScope
# Charger Gemma 3 avec Scope
scope = GemmaScope(model_name="gemma-3-9b")
# Analyser activation pour prompt
prompt = "Paris est la capitale de"
activations = scope.get_activations(prompt)
# Top features actives
top_features = scope.get_top_features(activations, k=10)
print("=== TOP 10 ACTIVATED FEATURES ===")
for feature_id, activation_strength in top_features:
feature_desc = scope.get_feature_description(feature_id)
print(f"Feature {feature_id}: {feature_desc} (strength: {activation_strength:.3f})")
# SORTIE EXEMPLE :
# Feature 1847: Geographic location / capital city (strength: 0.892)
# Feature 3201: French language context (strength: 0.654)
# Feature 892: European geography (strength: 0.543)
# ...
# Use case : Detecter biais
prompt_biased = "Les femmes sont"
activations_biased = scope.get_activations(prompt_biased)
top_biased = scope.get_top_features(activations_biased, k=5)
# Si feature "gender stereotype" active โ red flag pour review
๐ก๏ธ Guardrails Implementation
class ResponsibleAIGuardrails:
def __init__(self, model: GenerativeModel):
self.model = model
self.blocked_keywords = [
"hack", "exploit", "crack", "bypass",
# ... ajouter keywords sensibles pour votre domaine
]
def check_prompt_safety(self, prompt: str) -> dict:
"""Pre-flight checks avant envoi a Gemini"""
issues = []
# 1. Check PII
if self._contains_pii(prompt):
issues.append("PII_DETECTED")
# 2. Check blocked keywords
if any(kw in prompt.lower() for kw in self.blocked_keywords):
issues.append("BLOCKED_KEYWORD")
# 3. Check prompt injection
if self._is_prompt_injection(prompt):
issues.append("PROMPT_INJECTION")
return {
"safe": len(issues) == 0,
"issues": issues
}
def _contains_pii(self, text: str) -> bool:
"""Detecter PII (simplifie, utiliser DLP en prod)"""
import re
# SSN pattern
ssn_pattern = r'\b\d{3}-\d{2}-\d{4}\b'
if re.search(ssn_pattern, text):
return True
# Email pattern
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
if re.search(email_pattern, text):
return True
return False
def _is_prompt_injection(self, text: str) -> bool:
"""Detecter tentative prompt injection"""
injection_patterns = [
"ignore previous instructions",
"disregard above",
"new instructions:",
"system:",
]
return any(pattern in text.lower() for pattern in injection_patterns)
def generate_safe(self, prompt: str):
"""Generate avec guardrails"""
# Pre-flight checks
safety_check = self.check_prompt_safety(prompt)
if not safety_check["safe"]:
raise ValueError(f"Prompt blocked: {safety_check['issues']}")
# Generate
response = self.model.generate_content(prompt)
# Post-flight checks
if response.candidates[0].finish_reason.name == "SAFETY":
raise ValueError("Response blocked by safety filter")
return response.text
# Utilisation
guardrails = ResponsibleAIGuardrails(model_strict)
# Safe prompt
try:
response1 = guardrails.generate_safe("Explique la photosynthese")
print("โ Safe:", response1[:100])
except ValueError as e:
print("โ Blocked:", e)
# Unsafe prompt (PII)
try:
response2 = guardrails.generate_safe("Mon email est john@example.com, aide-moi")
print("โ Safe:", response2[:100])
except ValueError as e:
print("โ Blocked:", e)
# โ Blocked: ['PII_DETECTED']
IA Responsable n'est pas optionnel. Configurez safety settings strictes en prod (BLOCK_LOW_AND_ABOVE), loggez tous les events safety pour audit. Implementez guardrails pre/post pour bloquer PII, prompt injection, keywords sensibles. Utilisez Gemma Scope pour interpreter decisions et detecter biais. Google AI Principles = framework solide, suivez-le.
Gouvernance des Modeles
๐ฏ Objectifs d'apprentissage
- Gerer model lifecycle (dev โ staging โ prod)
- Implementer versioning et deprecation strategy
- Migrer 2.0 โ 2.5 en production
- Documenter decisions avec ADR
๐ Model Lifecycle Management
๐ Model Registry & Versioning
# model_registry.py
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
import json
class ModelStage(Enum):
DEVELOPMENT = "dev"
STAGING = "staging"
PRODUCTION = "prod"
DEPRECATED = "deprecated"
@dataclass
class ModelVersion:
name: str # e.g., "gemini-2.5-flash"
version: str # e.g., "v1.2.3"
stage: ModelStage
created_at: datetime
promoted_at: datetime = None
deprecated_at: datetime = None
performance_metrics: dict = None
notes: str = ""
class ModelRegistry:
def __init__(self, registry_file: str = "model_registry.json"):
self.registry_file = registry_file
self.models = self._load_registry()
def _load_registry(self) -> dict:
"""Load registry from file"""
try:
with open(self.registry_file, "r") as f:
data = json.load(f)
# Convert to ModelVersion objects
models = {}
for key, val in data.items():
val['stage'] = ModelStage(val['stage'])
val['created_at'] = datetime.fromisoformat(val['created_at'])
if val.get('promoted_at'):
val['promoted_at'] = datetime.fromisoformat(val['promoted_at'])
if val.get('deprecated_at'):
val['deprecated_at'] = datetime.fromisoformat(val['deprecated_at'])
models[key] = ModelVersion(**val)
return models
except FileNotFoundError:
return {}
def _save_registry(self):
"""Save registry to file"""
data = {}
for key, model in self.models.items():
data[key] = {
'name': model.name,
'version': model.version,
'stage': model.stage.value,
'created_at': model.created_at.isoformat(),
'promoted_at': model.promoted_at.isoformat() if model.promoted_at else None,
'deprecated_at': model.deprecated_at.isoformat() if model.deprecated_at else None,
'performance_metrics': model.performance_metrics,
'notes': model.notes
}
with open(self.registry_file, "w") as f:
json.dump(data, f, indent=2)
def register_model(self, name: str, version: str, stage: ModelStage, notes: str = ""):
"""Register new model version"""
key = f"{name}@{version}"
self.models[key] = ModelVersion(
name=name,
version=version,
stage=stage,
created_at=datetime.now(),
notes=notes
)
self._save_registry()
print(f"โ Registered {key} in {stage.value}")
def promote_model(self, name: str, version: str, to_stage: ModelStage):
"""Promote model to next stage"""
key = f"{name}@{version}"
if key not in self.models:
raise ValueError(f"Model {key} not found in registry")
self.models[key].stage = to_stage
self.models[key].promoted_at = datetime.now()
self._save_registry()
print(f"โ Promoted {key} to {to_stage.value}")
def deprecate_model(self, name: str, version: str, reason: str):
"""Deprecate model version"""
key = f"{name}@{version}"
if key not in self.models:
raise ValueError(f"Model {key} not found in registry")
self.models[key].stage = ModelStage.DEPRECATED
self.models[key].deprecated_at = datetime.now()
self.models[key].notes += f"\nDeprecated: {reason}"
self._save_registry()
print(f"โ Deprecated {key}: {reason}")
def get_active_model(self, name: str, stage: ModelStage) -> ModelVersion:
"""Get active model version for stage"""
active_models = [
model for model in self.models.values()
if model.name == name and model.stage == stage
]
if not active_models:
raise ValueError(f"No active {name} model in {stage.value}")
# Return most recent
return sorted(active_models, key=lambda m: m.created_at, reverse=True)[0]
def list_models(self, stage: ModelStage = None):
"""List all models, optionally filtered by stage"""
models = self.models.values()
if stage:
models = [m for m in models if m.stage == stage]
for model in sorted(models, key=lambda m: m.created_at, reverse=True):
print(f"{model.name}@{model.version} | {model.stage.value} | {model.created_at.date()}")
# Usage
registry = ModelRegistry()
# Register new model in dev
registry.register_model(
name="gemini-2.5-flash",
version="v1.0.0",
stage=ModelStage.DEVELOPMENT,
notes="Initial deployment with context caching"
)
# After testing, promote to staging
registry.promote_model(
name="gemini-2.5-flash",
version="v1.0.0",
to_stage=ModelStage.STAGING
)
# After staging validation, promote to prod
registry.promote_model(
name="gemini-2.5-flash",
version="v1.0.0",
to_stage=ModelStage.PRODUCTION
)
# Deploy new version
registry.register_model(
name="gemini-2.5-flash",
version="v1.1.0",
stage=ModelStage.DEVELOPMENT,
notes="Added model routing"
)
# Deprecate old version
registry.deprecate_model(
name="gemini-1.5-flash",
version="v0.9.0",
reason="Migrated to Gemini 2.5"
)
# List prod models
print("\n=== PRODUCTION MODELS ===")
registry.list_models(stage=ModelStage.PRODUCTION)
๐ Migration Strategy: 2.0 โ 2.5
class ModelMigration:
"""Gerer migration progressive entre versions"""
def __init__(self, old_model: str, new_model: str):
self.old_model = GenerativeModel(old_model)
self.new_model = GenerativeModel(new_model)
self.rollout_percentage = 0
def set_rollout(self, percentage: int):
"""Set traffic split (0-100% vers new model)"""
if not 0 <= percentage <= 100:
raise ValueError("Percentage must be 0-100")
self.rollout_percentage = percentage
print(f"Rollout: {percentage}% โ {self.new_model._model_name}")
def generate_content(self, prompt: str):
"""Generate avec traffic splitting"""
import random
# Traffic split
if random.randint(0, 99) < self.rollout_percentage:
# Route to new model
print(f"[Routing] โ NEW: {self.new_model._model_name}")
return self.new_model.generate_content(prompt)
else:
# Route to old model
print(f"[Routing] โ OLD: {self.old_model._model_name}")
return self.old_model.generate_content(prompt)
# Migration progressive 2.0 โ 2.5
migration = ModelMigration(
old_model="gemini-2.0-flash-exp",
new_model="gemini-2.5-flash"
)
# Week 1: 10% traffic vers 2.5
migration.set_rollout(10)
for i in range(10):
migration.generate_content("Test query")
# โ 1 requete vers 2.5, 9 vers 2.0
# Week 2: Monitor metrics, si OK โ 50%
migration.set_rollout(50)
# Week 3: 100% vers 2.5
migration.set_rollout(100)
# Deprecate 2.0
registry.deprecate_model(
name="gemini-2.0-flash-exp",
version="v1.0.0",
reason="Fully migrated to 2.5"
)
๐ Architecture Decision Records (ADR)
# ADR-001: Migration vers Gemini 2.5 Flash ## Status ACCEPTED - 2026-02-01 ## Context Notre application chatbot support utilise Gemini 2.0 Flash Exp depuis 6 mois. Gemini 2.5 Flash offre meilleures performances (+15% qualite) et meme prix. ## Decision Migrer progressivement vers Gemini 2.5 Flash sur 3 semaines : - Week 1 : 10% traffic (canary) - Week 2 : 50% traffic (validation large scale) - Week 3 : 100% traffic (full rollout) ## Consequences ### Positive - +15% quality score (evaluation benchmark interne) - Latence identique (~800ms p95) - Cout identique ($0.15/1M input) - Support long context (2M tokens vs 1M) ### Negative - Risque regression qualite (mitigation : canary + rollback plan) - Effort migration : 2 engineer-days ### Neutral - API identique, pas de code changes ## Rollback Plan Si quality score < baseline : 1. Rollback immediate vers 2.0 Flash 2. Root cause analysis 3. Re-evaluation decision ## Monitoring - Quality score (target: >85%) - Latency p50/p95 (target: <1000ms) - Cost per conversation (target: <$0.001) - Error rate (target: <1%) ## References - Benchmark results: docs/benchmarks/2.0-vs-2.5.md - Gemini 2.5 release notes: https://cloud.google.com/vertex-ai/docs/release-notes
- 1 ADR par decision majeure (model change, architecture change)
- Template standardise : Status, Context, Decision, Consequences
- Stocker dans Git (docs/adr/)
- Review en equipe avant ACCEPTED
๐ Model Deprecation Timeline
| Phase | Duration | Actions |
|---|---|---|
| Annonce | T-90 days | Communication interne/externe, migration guide publie |
| Warning | T-60 days | Deprecation warnings dans logs, emails equipes |
| Migration | T-30 days | Support migration actif, office hours |
| Sunset | T-0 | Model desactive, requetes rejettees avec error explicite |
Gouvernance modeles = discipline essentielle en production. Utilisez model registry pour tracker versions actives par environnement. Migration progressive (10% โ 50% โ 100%) reduit risque. ADR documente WHY derriere chaque decision majeure (critical pour onboarding et audits). Deprecation avec 90 days notice = respect users.
Agent Governance
๐ฏ Objectifs d'apprentissage
- Implementer tool governance dans Agent Builder
- Gerer permissions et access control
- Auditer actions agents
- Utiliser agent marketplace securise
๐ก๏ธ Tool Governance Framework
from vertexai.preview import reasoning_engines
from google.cloud import firestore
from datetime import datetime
from enum import Enum
class ToolRiskLevel(Enum):
LOW = "low" # Read-only, pas d'impact business
MEDIUM = "medium" # Modifications limitees
HIGH = "high" # Actions critiques (delete, paiements)
CRITICAL = "critical" # Actions irreversibles
class ToolGovernance:
def __init__(self, firestore_db):
self.db = firestore_db
self.audit_collection = "agent_tool_audit"
def register_tool(self, tool_name: str, risk_level: ToolRiskLevel,
requires_approval: bool = False):
"""Enregistrer tool avec niveau de risque"""
tool_doc = {
"name": tool_name,
"risk_level": risk_level.value,
"requires_approval": requires_approval,
"registered_at": datetime.now(),
"allowed_agents": [], # Whitelist agents
}
self.db.collection("tool_registry").document(tool_name).set(tool_doc)
print(f"โ Tool registered: {tool_name} (risk: {risk_level.value})")
def approve_tool_for_agent(self, tool_name: str, agent_id: str, approved_by: str):
"""Approuver tool pour agent specifique"""
tool_ref = self.db.collection("tool_registry").document(tool_name)
tool_doc = tool_ref.get()
if not tool_doc.exists:
raise ValueError(f"Tool {tool_name} not registered")
# Ajouter agent a whitelist
tool_ref.update({
"allowed_agents": firestore.ArrayUnion([agent_id])
})
# Logger approval
self._audit_log({
"event": "TOOL_APPROVED",
"tool": tool_name,
"agent": agent_id,
"approved_by": approved_by,
"timestamp": datetime.now()
})
print(f"โ Tool {tool_name} approved for agent {agent_id}")
def check_tool_permission(self, tool_name: str, agent_id: str) -> bool:
"""Verifier si agent peut utiliser tool"""
tool_doc = self.db.collection("tool_registry").document(tool_name).get()
if not tool_doc.exists:
return False
tool_data = tool_doc.to_dict()
# Check whitelist
if agent_id not in tool_data.get("allowed_agents", []):
return False
return True
def audit_tool_call(self, tool_name: str, agent_id: str, params: dict, result: dict):
"""Auditer appel tool"""
self._audit_log({
"event": "TOOL_CALLED",
"tool": tool_name,
"agent": agent_id,
"params": params,
"result": result,
"timestamp": datetime.now()
})
def _audit_log(self, log_entry: dict):
"""Logger event audit dans Firestore"""
self.db.collection(self.audit_collection).add(log_entry)
def get_audit_trail(self, agent_id: str = None, tool_name: str = None, days: int = 30):
"""Recuperer audit trail"""
query = self.db.collection(self.audit_collection)
if agent_id:
query = query.where("agent", "==", agent_id)
if tool_name:
query = query.where("tool", "==", tool_name)
# Last N days
cutoff = datetime.now() - timedelta(days=days)
query = query.where("timestamp", ">=", cutoff)
results = query.stream()
print(f"=== AUDIT TRAIL (last {days} days) ===")
for doc in results:
data = doc.to_dict()
print(f"{data['timestamp']}: {data['event']} - {data.get('tool', 'N/A')} by {data.get('agent', 'N/A')}")
# Setup governance
db = firestore.Client()
governance = ToolGovernance(db)
# Register tools avec risk levels
governance.register_tool(
tool_name="search_knowledge_base",
risk_level=ToolRiskLevel.LOW,
requires_approval=False
)
governance.register_tool(
tool_name="update_customer_record",
risk_level=ToolRiskLevel.MEDIUM,
requires_approval=True
)
governance.register_tool(
tool_name="process_refund",
risk_level=ToolRiskLevel.HIGH,
requires_approval=True
)
governance.register_tool(
tool_name="delete_account",
risk_level=ToolRiskLevel.CRITICAL,
requires_approval=True
)
# Approuver tools pour agent support
governance.approve_tool_for_agent(
tool_name="search_knowledge_base",
agent_id="agent-support-001",
approved_by="admin@company.com"
)
governance.approve_tool_for_agent(
tool_name="update_customer_record",
agent_id="agent-support-001",
approved_by="admin@company.com"
)
# Agent finance peut process_refund
governance.approve_tool_for_agent(
tool_name="process_refund",
agent_id="agent-finance-001",
approved_by="finance-manager@company.com"
)
# Verifier permissions
can_search = governance.check_tool_permission("search_knowledge_base", "agent-support-001")
print(f"Agent support can search KB: {can_search}") # True
can_delete = governance.check_tool_permission("delete_account", "agent-support-001")
print(f"Agent support can delete account: {can_delete}") # False
# Audit trail
governance.get_audit_trail(agent_id="agent-support-001", days=7)
๐ Agent Permission Model
| Agent Type | Allowed Tools | Risk Level | Approval Required |
|---|---|---|---|
| Customer Support | Search KB, View orders, Update contact info | LOW-MEDIUM | Manager approval |
| Sales | CRM lookup, Create quote, Schedule demo | LOW-MEDIUM | Sales manager approval |
| Finance | Process refund, Generate invoice, View transactions | MEDIUM-HIGH | Finance manager approval |
| Admin | All tools including delete, modify settings | HIGH-CRITICAL | C-level approval |
๐ Agent Marketplace Governance
class AgentMarketplace:
"""Marketplace interne pour partager agents securises"""
def __init__(self, firestore_db):
self.db = firestore_db
def publish_agent(self, agent_config: dict, publisher: str):
"""Publier agent dans marketplace"""
# Validation security
self._validate_agent_security(agent_config)
agent_doc = {
**agent_config,
"publisher": publisher,
"published_at": datetime.now(),
"status": "pending_review", # Require review avant usage
"downloads": 0,
"ratings": []
}
agent_id = self.db.collection("agent_marketplace").add(agent_doc)[1].id
print(f"โ Agent published for review: {agent_id}")
return agent_id
def _validate_agent_security(self, agent_config: dict):
"""Valider securite agent avant publication"""
# Check 1: Pas de hardcoded secrets
if "api_key" in str(agent_config).lower():
raise ValueError("Agent contains hardcoded API keys")
# Check 2: Tools approuves uniquement
tools = agent_config.get("tools", [])
for tool in tools:
tool_doc = self.db.collection("tool_registry").document(tool).get()
if not tool_doc.exists:
raise ValueError(f"Tool {tool} not approved in registry")
# Check 3: System instruction pas malicieux
system_instruction = agent_config.get("system_instruction", "")
malicious_keywords = ["ignore", "disregard", "bypass"]
if any(kw in system_instruction.lower() for kw in malicious_keywords):
raise ValueError("System instruction contains suspicious keywords")
def approve_agent(self, agent_id: str, reviewer: str):
"""Approuver agent apres review"""
agent_ref = self.db.collection("agent_marketplace").document(agent_id)
agent_ref.update({
"status": "approved",
"reviewed_by": reviewer,
"reviewed_at": datetime.now()
})
print(f"โ Agent {agent_id} approved by {reviewer}")
def install_agent(self, agent_id: str, user: str):
"""Installer agent depuis marketplace"""
agent_doc = self.db.collection("agent_marketplace").document(agent_id).get()
if not agent_doc.exists:
raise ValueError(f"Agent {agent_id} not found")
agent_data = agent_doc.to_dict()
if agent_data["status"] != "approved":
raise ValueError(f"Agent not approved for installation")
# Increment download counter
self.db.collection("agent_marketplace").document(agent_id).update({
"downloads": firestore.Increment(1)
})
# Logger installation
self.db.collection("agent_installs").add({
"agent_id": agent_id,
"user": user,
"installed_at": datetime.now()
})
print(f"โ Agent {agent_id} installed for {user}")
return agent_data
# Setup marketplace
marketplace = AgentMarketplace(db)
# Publier agent customer support
support_agent_config = {
"name": "Customer Support Agent v2",
"description": "Agent support avec acces KB et CRM",
"model": "gemini-2.5-flash",
"tools": ["search_knowledge_base", "update_customer_record"],
"system_instruction": "Tu es un assistant support...",
}
agent_id = marketplace.publish_agent(support_agent_config, publisher="team-support@company.com")
# Review & approve
marketplace.approve_agent(agent_id, reviewer="security@company.com")
# Installer pour autre equipe
marketplace.install_agent(agent_id, user="team-sales@company.com")
๐ Audit Dashboard
def generate_agent_audit_report(governance: ToolGovernance, days: int = 30):
"""Generate rapport audit agent activities"""
query = governance.db.collection(governance.audit_collection)
cutoff = datetime.now() - timedelta(days=days)
query = query.where("timestamp", ">=", cutoff)
events = [doc.to_dict() for doc in query.stream()]
# Statistiques
total_calls = len(events)
unique_agents = len(set(e.get("agent") for e in events))
unique_tools = len(set(e.get("tool") for e in events))
# Top tools
tool_counts = {}
for event in events:
tool = event.get("tool")
if tool:
tool_counts[tool] = tool_counts.get(tool, 0) + 1
print(f"=== AGENT AUDIT REPORT (Last {days} days) ===\n")
print(f"Total tool calls: {total_calls}")
print(f"Active agents: {unique_agents}")
print(f"Tools used: {unique_tools}\n")
print("Top 5 tools:")
for tool, count in sorted(tool_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {tool}: {count} calls")
# Detect anomalies (high-risk tool usage)
high_risk_calls = [
e for e in events
if e.get("tool") in ["process_refund", "delete_account"]
]
if high_risk_calls:
print(f"\nโ ๏ธ {len(high_risk_calls)} high-risk tool calls detected:")
for call in high_risk_calls[:10]:
print(f" {call['timestamp']}: {call['tool']} by {call['agent']}")
# Generate rapport mensuel
generate_agent_audit_report(governance, days=30)
Agent governance protege votre entreprise. Tool registry avec risk levels = control granulaire. Permission whitelist = least privilege principle. Audit trail complet = compliance & forensics. Marketplace interne = reutilisation securisee agents. En prod, ajoutez : rate limiting par agent, anomaly detection (calls inhabituels), quarterly access review.
Gemma & Open Source
๐ฏ Objectifs d'apprentissage
- Comprendre Gemma 3/3n et use cases
- Deployer Gemma on-device (Nano)
- Fine-tuner Gemma pour domaine specifique
- Utiliser Gemma Scope 2 pour interpretabilite
๐ Famille Gemma (2026)
| Modele | Taille | Use Case | Deployment |
|---|---|---|---|
| Gemma 3 27B | 27B params | Self-hosting, fine-tuning custom | GKE, on-prem, cloud VM |
| Gemma 3 9B | 9B params | Edge servers, latency-critical | Edge TPU, GPU servers |
| Gemma 3 2B | 2B params | Mobile apps, IoT devices | Android, iOS, Raspberry Pi |
| Gemma Nano | 1.8B params | On-device inference (offline) | Smartphones, laptops |
๐ฑ Deploy Gemma Nano On-Device
# Installation # pip install mediapipe import mediapipe as mp from mediapipe.tasks import python from mediapipe.tasks.python import text # Download Gemma Nano model (1.8B, quantized 4-bit) # https://ai.google.dev/gemma/docs/get_started # Initialize Gemma Nano base_options = python.BaseOptions(model_asset_path='gemma_nano_2b_quantized.bin') options = text.TextGeneratorOptions(base_options=base_options, max_tokens=256) generator = text.TextGenerator.create_from_options(options) # Generate on-device (offline) prompt = "Explique la photosynthese en 2 phrases" result = generator.generate(prompt) print(result.text) # โ Inference 100% locale, pas besoin internet # โ Latence ~500ms sur smartphone recent # โ Privacy total (donnees ne quittent pas device) # Use cases on-device : # - Keyboard suggestions # - Voice assistant offline # - Document summarization (emails, PDFs) # - Privacy-sensitive apps (medical, finance)
๐ Fine-Tuning Gemma
# Fine-tune Gemma 3 9B pour domaine medical
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
import torch
# 1. Charger Gemma 3 9B
model_name = "google/gemma-3-9b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# 2. Preparer dataset medical (exemple)
# Format : {"prompt": "...", "completion": "..."}
dataset = load_dataset("medical-qa-dataset") # Votre dataset
def preprocess_function(examples):
inputs = [f"Question: {q}\nAnswer:" for q in examples["prompt"]]
targets = examples["completion"]
model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=256, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_dataset = dataset.map(preprocess_function, batched=True)
# 3. Configurer fine-tuning
training_args = TrainingArguments(
output_dir="./gemma-medical-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
warmup_steps=100,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
fp16=True, # Mixed precision
push_to_hub=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
# 4. Fine-tune (4-8h sur 4x A100)
trainer.train()
# 5. Sauvegarder modele fine-tuned
model.save_pretrained("./gemma-medical-finetuned")
tokenizer.save_pretrained("./gemma-medical-finetuned")
# 6. Inference avec modele fine-tuned
finetuned_model = AutoModelForCausalLM.from_pretrained("./gemma-medical-finetuned")
prompt = "Question: Quels sont les symptomes du diabete de type 2 ?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = finetuned_model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0]))
# ECONOMIE vs Gemini API :
# Fine-tuning : $500-2000 one-time (compute)
# Self-hosting : $100-500/month (VM/GPU)
# vs Gemini API : $500-5000/month si volume eleve
# โ ROI positif si >10M tokens/month
๐ Gemma Scope 2 : Interpretabilite
# pip install gemma-scope
from gemma_scope import GemmaScope, FeatureVisualizer
# Load Gemma 3 avec Scope
scope = GemmaScope(
model_name="gemma-3-9b",
sae_layer=15 # Sparse AutoEncoder layer 15
)
# Analyser activation pour prompt
prompt = "Le vaccin COVID cause l'autisme" # Misinformation
activations = scope.get_activations(prompt)
top_features = scope.get_top_features(activations, k=20)
print("=== TOP ACTIVATED FEATURES ===")
for feature_id, strength in top_features:
desc = scope.get_feature_description(feature_id)
print(f"Feature {feature_id}: {desc} ({strength:.3f})")
# SORTIE EXEMPLE :
# Feature 1892: Medical misinformation (0.912) โ ๏ธ
# Feature 3405: Vaccine-related content (0.854)
# Feature 8721: Controversial claims (0.743)
# โ Model detecte misinformation !
# Visualiser features actives
visualizer = FeatureVisualizer(scope)
visualizer.plot_feature_activation(prompt, top_k=10)
visualizer.save("feature_activation.png")
# Use cases Gemma Scope :
# 1. Detecter biais dans reponses
# 2. Expliquer pourquoi modele genere telle reponse
# 3. Identifier features problematiques pour fine-tuning
# 4. Audit & compliance (expliquer decisions IA)
๐ก๏ธ Gemma & AI Safety
- Safety filters : Pre-trained pour bloquer harmful content
- Open weights : Auditabilite complete du modele
- Responsible AI Toolkit : Outils pour evaluer biais, toxicite
- Gemma Scope : Interpretabilite via sparse autoencoders
# Evaluer toxicite avec Gemma
from transformers import pipeline
# Charger Gemma 3 2B pour classification
classifier = pipeline(
"text-classification",
model="google/gemma-3-2b-toxicity-classifier"
)
# Tester prompts
prompts = [
"Comment installer Python ?",
"Je deteste tous les [groupe]", # Toxic
"Explique la photosynthese"
]
for prompt in prompts:
result = classifier(prompt)[0]
label = result['label'] # TOXIC ou NON_TOXIC
score = result['score']
print(f"Prompt: {prompt[:50]}")
print(f" โ {label} (confidence: {score:.2f})\n")
# Integration dans pipeline production
def safe_generate(prompt: str, model):
"""Generate avec toxicity check"""
# Pre-check
toxicity = classifier(prompt)[0]
if toxicity['label'] == 'TOXIC' and toxicity['score'] > 0.8:
return "Je ne peux pas repondre a cette requete."
# Generate
response = model.generate(prompt)
# Post-check
response_toxicity = classifier(response)[0]
if response_toxicity['label'] == 'TOXIC':
return "Reponse filtree pour contenu inapproprie."
return response
๐ Gemma Ecosystem
| Tool | Purpose | Link |
|---|---|---|
| Gemma.cpp | Inference C++ optimise (CPU) | github.com/google/gemma.cpp |
| Gemma Android | SDK Android pour on-device | ai.google.dev/gemma/docs/android |
| Gemma Scope | Interpretabilite SAE | github.com/google-research/gemma-scope |
| Gemma Safety | Toxicity/bias evaluation | github.com/google/responsible-ai |
| Kaggle Models | Download weights (free) | kaggle.com/models/google/gemma |
Gemma = alternative open-source a Gemini pour use cases self-hosted. Gemma 3 27B competitive avec modeles propietaires pour domaines specifiques apres fine-tuning. Gemma Nano revolutionne on-device AI (privacy, offline). Fine-tuning ROI positif si >10M tokens/mois. Gemma Scope = interpretabilite unique dans industrie. Utilisez Gemma pour : data sensible (medical, finance), offline apps, cost optimization.
Ecosysteme Google AI
๐ฏ Objectifs d'apprentissage
- Explorer NotebookLM et Workspace AI
- Comprendre Code Assist et Astra DB
- Decouvrir Mariner, Jules et AI Overviews
- Integrer ecosysteme Google AI
๐ Carte Ecosysteme Google AI
๐ NotebookLM : AI Research Assistant
NotebookLM transforme vos documents en assistant IA interactif.
# NotebookLM via API (preview)
# pip install google-notebooklm
from google.notebooklm import NotebookLM
# Creer notebook
notebook = NotebookLM.create(name="Product Documentation")
# Upload sources (PDFs, docs, URLs)
notebook.add_source(file="product_manual.pdf")
notebook.add_source(file="api_docs.md")
notebook.add_source(url="https://docs.product.com/guide")
# Query avec context automatique
response = notebook.query(
"Comment configurer authentication OAuth ?"
)
print(response.answer)
# โ Reponse synthetisee depuis les 3 sources
# โ Citations automatiques vers sources
print("\nSources :")
for citation in response.citations:
print(f"- {citation.source}: {citation.excerpt}")
# Use cases :
# - Onboarding nouveaux employes (docs entreprise)
# - Support client (knowledge base)
# - Research (papers scientifiques)
# - Audit & compliance (reglementations)
๐ผ Workspace AI : Gmail, Docs, Sheets
| Product | AI Feature | Example |
|---|---|---|
| Gmail | Help me write (email drafting) | "Draft email to decline meeting professionally" |
| Docs | Help me write (content generation) | "Write product launch announcement, tone: excited" |
| Sheets | Help me organize (data analysis) | "Create pivot table summarizing sales by region" |
| Slides | Create presentation | "Create 10-slide deck on Q4 results with charts" |
| Meet | Real-time transcription | Auto-generate meeting notes with action items |
๐ป Code Assist : AI Coding
# Cloud Code Assist = Gemini Code Assist pour GCP # Integration IDE : VS Code, IntelliJ, Cloud Shell Editor # Exemples use cases : # 1. Code generation # Prompt: "Create Cloud Function to resize images uploaded to GCS" # โ Generates complete Python code with error handling # 2. Code explanation # Select complex code block โ "Explain this code" # โ Natural language explanation ligne par ligne # 3. Code migration # "Convert this App Engine app to Cloud Run" # โ Generates Dockerfile, deployment config, migration guide # 4. Debugging # Paste error โ "How to fix this error?" # โ Root cause analysis + solution # 5. Security review # "Check this code for security vulnerabilities" # โ Identifies SQL injection, XSS, secrets in code # Pricing : # Code Assist : $19/user/month # Alternative : GitHub Copilot ($10/month), Claude Code (free beta)
๐๏ธ Astra DB : Vector Database
Astra DB (DataStax) = managed vector database optimise pour RAG avec Gemini.
# pip install astrapy
from astrapy.client import DataAPIClient
from vertexai.language_models import TextEmbeddingModel
# Connect to Astra DB
client = DataAPIClient(token="AstraCS:xxx")
database = client.get_database("https://xxx.apps.astra.datastax.com")
collection = database.get_collection("documents")
# Embed documents avec Gemini
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")
documents = [
"Gemini 2.5 Pro released February 2026",
"Context caching reduces cost by 90%",
"Flash-8B is 75x cheaper than Pro"
]
for doc in documents:
# Generate embedding
embedding = embedding_model.get_embeddings([doc])[0].values
# Insert dans Astra
collection.insert_one({
"text": doc,
"embedding": embedding
})
# Vector search
query = "How to reduce Gemini costs?"
query_embedding = embedding_model.get_embeddings([query])[0].values
results = collection.vector_find(
vector=query_embedding,
limit=3
)
for result in results:
print(f"Score: {result['$similarity']:.3f} - {result['text']}")
# Astra advantages :
# - Latency <10ms (global distribution)
# - Auto-scaling (serverless)
# - Integrated with Langchain, LlamaIndex
# - Free tier : 80GB storage
๐ Mariner : Web Agent
Mariner = agent Gemini qui navigue web automatiquement.
# Mariner (preview, disponible via Chrome extension) # Use cases : # 1. Research automation # "Find 10 competitors in AI coding assistants space with pricing" # โ Mariner visite sites, extrait pricing, genere tableau # 2. E-commerce # "Compare prices for iPhone 15 Pro on Amazon, BestBuy, Target" # โ Mariner navigue sites, compare prix en temps reel # 3. Travel booking # "Find cheapest flight Paris to NYC, March 15-22" # โ Mariner compare Google Flights, Kayak, Expedia # 4. Data collection # "Scrape product reviews from top 50 items on category page" # โ Mariner navigue pages, extrait reviews, structure data # Architecture : # User query โ Gemini 2.5 Pro โ Plans actions โ Mariner agent # โ Executes browser actions (click, scroll, extract) # โ Returns structured results # Privacy : Mariner runs locally in browser, pas de data sent to Google
๐จโ๐ป Jules : AI Code Agent
Jules = agent autonome pour fix bugs et implement features.
# Jules integration GitHub (preview) # Workflow : # 1. Create GitHub issue : "Fix: API returns 500 on invalid input" # 2. Assign to @jules-ai # 3. Jules : # - Reads issue description # - Analyzes codebase # - Identifies root cause # - Fixes bug # - Writes tests # - Creates PR with explanation # 4. Human review โ Merge # Example issue : # Title: "Add caching to reduce Gemini API costs" # Description: "Implement Redis cache for repeated queries" # Jules actions : # - Reads current code # - Installs Redis client # - Implements cache layer avec TTL # - Adds monitoring metrics # - Creates PR avec benchmark results # Similar tools : Devin, Cursor Agent, GitHub Copilot Workspace
๐ AI Overviews : Search with Gemini
AI Overviews = Google Search integre Gemini pour reponses directes.
# AI Overviews API (preview)
# pip install google-search-ai
from google.search import AIOverviewsClient
client = AIOverviewsClient()
# Query avec AI-generated overview
query = "How to reduce Gemini API costs in production?"
result = client.search(query)
# Overview = Gemini-generated summary
print("=== AI OVERVIEW ===")
print(result.overview.text)
# Traditional search results
print("\n=== SOURCES ===")
for source in result.sources:
print(f"- {source.title}: {source.url}")
# SORTIE EXEMPLE :
# === AI OVERVIEW ===
# To reduce Gemini API costs in production:
# 1. Use context caching for repeated content (-90% cost)
# 2. Route simple queries to Flash-8B instead of Pro (-75x cost)
# 3. Use Batch API for non-urgent workloads (-50% cost)
# 4. Compress prompts and control max_output_tokens
#
# === SOURCES ===
# - Vertex AI Pricing: https://cloud.google.com/vertex-ai/pricing
# - Context Caching Guide: https://...
# - Best Practices: https://...
# Use case : Integrate AI Overviews dans apps pour rich answers
Ecosysteme Google AI est vaste et en expansion rapide. NotebookLM revolutionne research, Workspace AI boost productivite quotidienne. Code Assist accelere dev, Astra DB optimise RAG. Mariner automatise web browsing, Jules fix bugs autonomously. AI Overviews transforme search. En 2026, convergence Gemini + tools Google = productivity multiplier. Explorez, experimentez, integrez dans workflows.
Tendances & Futur de l'IA
๐ฏ Objectifs d'apprentissage
- Comprendre Universal Agent vision
- Explorer Generative UI paradigm
- Anticiper Personal Intelligence evolution
- Preparer architecture pour futur multimodal
๐ Universal Agent : One Agent to Rule Them All
Vision 2027-2030 : Agent unique capable d'accomplir toute tache digitale.
- Autonomy : Complete tasks end-to-end sans intervention humaine
- Context retention : Memoire long-term de toutes interactions
- Multi-tool orchestration : Utilise 100+ tools selon besoin
- Learning : Apprend de chaque interaction, s'ameliore
- Personalization : Adapte comportement a chaque user
๐จ Generative UI : UI qui S'Adapte
Paradigme shift : UI n'est plus statique, elle est generee dynamiquement par IA.
# Generative UI avec Gemini (concept 2026)
from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-2.5-pro")
# User request
user_request = "Je veux dashboard pour suivre mes couts Vertex AI"
# Generate UI dynamiquement
ui_generation_prompt = f"""
Generate React component code for this user request: "{user_request}"
Requirements:
- Use Recharts for visualizations
- Fetch data from /api/vertex-costs endpoint
- Responsive design with Tailwind
- Include filters: date range, model type
- Show total cost, cost by model (pie chart), daily trend (line chart)
Return ONLY valid React JSX code.
"""
response = model.generate_content(ui_generation_prompt)
react_code = response.text
# Save generated component
with open("CostDashboard.jsx", "w") as f:
f.write(react_code)
print("โ UI component generated!")
# Deploy automatiquement
# import subprocess
# subprocess.run(["npm", "run", "build"])
# subprocess.run(["gcloud", "run", "deploy", "cost-dashboard", ...])
# RESULTAT : Dashboard custom genere en <5 secondes
# โ Pas besoin developer, designer
# โ UI parfaitement adaptee a user request
# โ Iterations rapides : "Ajoute export CSV" โ regenere component
๐ง Personal Intelligence : AI qui Vous Connait
Personal Intelligence = Agent IA avec memoire complete de votre vie digitale.
# Personal Intelligence architecture (conceptuel)
class PersonalIntelligence:
"""Agent IA avec memoire long-term et personnalisation"""
def __init__(self, user_id: str):
self.user_id = user_id
self.memory = self._load_memory() # All interactions history
self.preferences = self._load_preferences()
self.context = self._load_context() # Calendar, emails, docs
def process_request(self, request: str):
"""Process request avec contexte personnel complet"""
# Enrichir request avec contexte
enriched_prompt = f"""
User: {self.user_id}
Request: {request}
Personal context:
- Preferences: {self.preferences}
- Recent interactions: {self.memory[-10:]}
- Current calendar: {self.context['calendar_today']}
- Work projects: {self.context['active_projects']}
Generate personalized response considering all context.
"""
response = self.model.generate_content(enriched_prompt)
# Save interaction to memory
self.memory.append({
"timestamp": datetime.now(),
"request": request,
"response": response.text
})
self._save_memory()
return response.text
# Examples use cases :
# 1. "Schedule meeting with Sarah"
# โ Agent knows Sarah's email, checks both calendars, proposes 3 slots
# 2. "Summarize what I missed this morning"
# โ Agent reads emails, Slack, calendar, genere summary personnalise
# 3. "Draft response to client email"
# โ Agent knows client history, your writing style, projet context
# 4. "Should I approve this expense?"
# โ Agent knows budget, spending patterns, company policies
# Privacy considerations :
# - All data encrypted at rest
# - User control over what data is accessible
# - Opt-in pour chaque data source
# - Delete memory on demand
๐ฑ Multimodal Native : Beyond Text
Futur : IA comprehend simultanรฉment text, image, audio, video, code, 3D.
# Multimodal use case futuriste
# Input : Voice + Screen share + Camera
# "Aide-moi a debugger cette app, voici mon ecran et le code"
from vertexai.generative_models import GenerativeModel, Part
model = GenerativeModel("gemini-3.0-ultra") # Hypothetical 2027 model
# Multimodal input
response = model.generate_content([
Part.from_audio_file("voice_explanation.wav"), # Voice explanation
Part.from_image_file("screenshot_error.png"), # Screenshot with error
Part.from_video_file("screen_recording.mp4"), # Screen recording
Part.from_text(open("app.py").read()), # Source code
"Debug this application and suggest fixes"
])
# Output : Multimodal response
# - Text explanation of bug
# - Code diff with fixes
# - Video tutorial showing how to fix
# - Voice explanation of root cause
print(response.text) # Textual explanation
# Access other modalities
if response.video:
response.video.save("fix_tutorial.mp4")
if response.audio:
response.audio.save("explanation.mp3")
# Use case futur : "Design a logo for my company"
# โ Input : voice description + mood board images
# โ Output : 5 logo variations (SVG + PNG) + usage guidelines PDF
๐ฎ On-Device AI : Privacy First
Tendance 2026-2030 : Modeles puissants executent localement sur devices.
- Privacy : Donnees ne quittent jamais device
- Latency : Inference instant (<100ms)
- Offline : Fonctionne sans internet
- Cost : Pas de frais API
- Scale : Millions users sans infra backend
๐ Tendances 2026-2030
| Tendance | Timeline | Impact |
|---|---|---|
| Universal Agent | 2027-2028 | 1 agent remplace 100+ apps specialisees |
| Generative UI | 2026-2027 | Developpeurs frontend reduits de 50% |
| Personal Intelligence | 2027-2029 | Productivite personnelle +30-50% |
| Multimodal natif | 2026-2027 | Text-only devient obsolete |
| On-device AI | 2026-2028 | Cloud AI complementaire, pas primaire |
| AI-first OS | 2028-2030 | OS traditionnels remplaces par AI OS |
Futur IA est multimodal, autonome, personnel, on-device. Universal Agent remplacera apps specialisees. Generative UI eliminera besoin UI designers pour use cases standards. Personal Intelligence deviendra extension naturelle cognition humaine. Preparez architectures pour ce futur : APIs modulaires, data ownership user-centric, privacy by design. 2026 = debut transformation, 2030 = monde different.
Projet Final : Architecture Enterprise Complete
๐ฏ Objectif du Projet
Concevoir et documenter une architecture Gemini enterprise complete pour un cas d'usage reel, integrant tous les concepts de Phase 4.
๐ Cahier des Charges
Entreprise : TechMart, e-commerce 50M users, 10M transactions/mois
Besoin : Platform support client AI-powered avec agents autonomes
Contraintes :
- Budget : $10,000/mois pour Vertex AI
- SLA : 99.9% uptime, <2s latency p95
- Compliance : GDPR, PCI-DSS
- Scale : Support 100,000 conversations/jour
- Langues : FR, EN, ES, DE
๐ Livrables Requis
1. Architecture Diagram (30 min)
Creer diagram architecture complete incluant :
- Multi-model routing (Pro/Flash/Flash-8B)
- RAG avec vector database
- Agent system avec tools
- Cache strategy
- Monitoring & alerting
- Security layers (VPC-SC, DLP, IAM)
# architecture.yaml - Exemple structure
components:
frontend:
type: Cloud Run
replicas: 3-10 (autoscaling)
regions: [us-central1, europe-west1]
model_router:
type: Cloud Run
logic: |
- Simple queries (FAQ) โ Flash-8B
- Standard (order status) โ Flash
- Complex (complaints) โ Pro
fallback: Pro
rag_system:
vector_db: Vertex AI Vector Search
embeddings: text-embedding-004
chunk_size: 512 tokens
top_k: 5
agent_system:
framework: Vertex AI Agent Builder
tools:
- search_knowledge_base (LOW risk)
- lookup_order (MEDIUM risk)
- process_refund (HIGH risk)
- update_customer_info (MEDIUM risk)
governance: Tool approval workflow
caching:
explicit: System instructions (5000 tokens, TTL 60min)
implicit: Auto-caching prefixes >1024 tokens
monitoring:
metrics:
- Request count by model
- Latency p50/p95/p99
- Cost per conversation
- Error rate
- Safety filter triggers
dashboards: Looker Studio + BigQuery
alerts:
- Budget >90% โ Email + PagerDuty
- Error rate >2% โ PagerDuty
- Latency p95 >3s โ Slack
security:
vpc_sc: Perimeter around Vertex AI
dlp: Scan prompts for PII before sending
iam:
- agents: roles/aiplatform.user
- developers: roles/aiplatform.admin
cmek: Customer-managed keys for data
audit: Data Access logs enabled
2. Implementation Plan (45 min)
Document plan implementation detaille :
# Implementation Plan ## Phase 1 : Foundation (Week 1-2) - [ ] Setup GCP project avec VPC-SC - [ ] Configure IAM roles et service accounts - [ ] Deploy base infrastructure (Cloud Run, Firestore) - [ ] Implement model router avec Flash-8B/Flash/Pro - [ ] Setup monitoring dashboard (BigQuery + Looker) ## Phase 2 : RAG System (Week 3-4) - [ ] Ingest knowledge base (product docs, FAQs) - [ ] Setup Vertex AI Vector Search - [ ] Implement chunking strategy (512 tokens) - [ ] Test retrieval quality (measure precision@5) - [ ] Optimize embeddings model ## Phase 3 : Agent System (Week 5-6) - [ ] Register tools dans tool registry - [ ] Implement tool approval workflow - [ ] Deploy agents avec Vertex AI Agent Builder - [ ] Configure agent permissions - [ ] Test agent workflows end-to-end ## Phase 4 : Optimization (Week 7-8) - [ ] Implement context caching (system instructions) - [ ] Configure batch processing for analytics - [ ] Optimize prompts (-30% tokens) - [ ] Setup cost attribution labels - [ ] Run load tests (100K requests/day) ## Phase 5 : Security & Compliance (Week 9-10) - [ ] Enable DLP for PII detection - [ ] Configure safety settings (BLOCK_LOW_AND_ABOVE) - [ ] Implement audit logging - [ ] GDPR compliance review - [ ] Security penetration testing ## Phase 6 : Production Deploy (Week 11-12) - [ ] Canary deploy (10% traffic) - [ ] Monitor metrics for 3 days - [ ] Rollout to 50% - [ ] Full production (100%) - [ ] Post-deploy monitoring 2 weeks
3. Cost Model (45 min)
Calculer couts detailles :
# cost_model.py
class CostModel:
def __init__(self):
# Pricing ($/1M tokens)
self.prices = {
"flash-8b": {"input": 0.04, "output": 0.16},
"flash": {"input": 0.15, "output": 0.60},
"pro": {"input": 3.00, "output": 12.00},
"cache": 0.015,
"embedding": 0.025,
}
def calculate_monthly_cost(self,
conversations_per_day: int,
avg_messages_per_conversation: int):
"""Calculer cout mensuel"""
total_conversations = conversations_per_day * 30
# Model distribution (apres routing)
flash_8b_pct = 0.60 # 60% simple queries
flash_pct = 0.30 # 30% standard
pro_pct = 0.10 # 10% complex
# Tokens par message
system_instruction_tokens = 5000 # Cached
user_input_tokens = 300
rag_context_tokens = 2000
output_tokens = 150
# Total messages
total_messages = total_conversations * avg_messages_per_conversation
# Cost breakdown
costs = {}
# 1. System instruction (cached)
cache_cost = (system_instruction_tokens * total_messages / 1_000_000) * self.prices["cache"]
costs["cache"] = cache_cost
# 2. Embeddings (RAG)
embedding_cost = (user_input_tokens * total_messages / 1_000_000) * self.prices["embedding"]
costs["embeddings"] = embedding_cost
# 3. LLM calls
for model, pct in [("flash-8b", flash_8b_pct), ("flash", flash_pct), ("pro", pro_pct)]:
model_messages = total_messages * pct
input_tokens = user_input_tokens + rag_context_tokens
input_cost = (input_tokens * model_messages / 1_000_000) * self.prices[model]["input"]
output_cost = (output_tokens * model_messages / 1_000_000) * self.prices[model]["output"]
costs[f"{model}_input"] = input_cost
costs[f"{model}_output"] = output_cost
total_cost = sum(costs.values())
return {
"total_monthly": total_cost,
"cost_per_conversation": total_cost / total_conversations,
"breakdown": costs
}
# Calculate pour TechMart
model = CostModel()
result = model.calculate_monthly_cost(
conversations_per_day=100_000,
avg_messages_per_conversation=5
)
print(f"=== COST MODEL ===")
print(f"Total monthly: ${result['total_monthly']:.2f}")
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"\nBreakdown:")
for item, cost in result['breakdown'].items():
print(f" {item}: ${cost:.2f}")
# Expected output :
# Total monthly: ~$8,500
# Cost per conversation: ~$0.0028
# โ Dans budget $10,000/mois
4. ADR Documentation (30 min)
Ecrire 3 ADRs pour decisions cles :
# ADR-001: Multi-Model Routing Strategy ## Status ACCEPTED ## Context TechMart needs to support 100K conversations/day within $10K/month budget. Using only Pro would cost ~$45K/month. Using only Flash-8B would degrade quality. ## Decision Implement intelligent model routing: - Flash-8B (60% traffic): FAQ, simple queries - Flash (30% traffic): Order status, standard support - Pro (10% traffic): Complex complaints, escalations Classifier: Flash-8B with 100-token prompt. ## Consequences ### Positive - Cost reduced from $45K to $8.5K/month (-81%) - Quality maintained (85% CSAT vs 87% all-Pro) - Classifier cost negligible ($50/month) ### Negative - Added complexity (router service) - Potential misclassification (~5% rate) ### Mitigation - Monitor classification accuracy - Fallback to Pro on errors - Weekly review misclassified queries --- # ADR-002: Context Caching for System Instructions ## Status ACCEPTED ## Context System instruction contains 5000 tokens (product catalog, policies, FAQs). Without caching: $0.00075 per message ร 15M messages = $11,250/month just for system instruction. ## Decision Enable explicit context caching with 60-min TTL. Pre-warm cache every 55 minutes to avoid cold starts. ## Consequences ### Positive - Cache cost: $1,125/month (vs $11,250 without) - Savings: $10,125/month (-90%) - No latency impact ### Negative - Cache management complexity - Risk of stale cache if system instruction changes ### Mitigation - Cache invalidation on system instruction update - Monitoring cache hit rate (target >95%) --- # ADR-003: DLP for PII Protection ## Status ACCEPTED ## Context GDPR requires protecting customer PII. Risk: Customers may share SSN, credit cards in chat. ## Decision Implement Cloud DLP to scan all user messages before sending to Gemini. Redact: SSN, credit cards, emails, phone numbers. ## Consequences ### Positive - GDPR compliance - Protect customer privacy - Prevent PII leakage to LLM ### Negative - Added latency: +50-100ms per message - Cost: $0.000015 per message = $450/month ### Mitigation - Async DLP (non-blocking for non-PII messages) - Cache DLP results for repeated messages
5. Security Checklist (20 min)
| Security Control | Implementation | Status |
|---|---|---|
| VPC Service Controls | Perimeter around Vertex AI, block data exfiltration | โ Required |
| DLP PII Scanning | Scan all prompts, redact SSN/CC/email | โ Required |
| IAM Least Privilege | Service accounts with minimal roles | โ Required |
| Audit Logging | Data Access logs for all Vertex AI calls | โ Required |
| Safety Settings | BLOCK_LOW_AND_ABOVE for all categories | โ Required |
| CMEK | Customer-managed encryption keys | โ ๏ธ Optional (highly recommended) |
| Private Service Connect | Vertex AI access via private endpoint | โ ๏ธ Optional (if ultra-secure network) |
| Secrets Management | API keys in Secret Manager | โ Required |
6. Monitoring Dashboard (20 min)
Definir metriques et alertes :
# monitoring.yaml
dashboards:
overview:
metrics:
- Total conversations (24h)
- Active conversations (realtime)
- Avg response time
- Cost today vs budget
- Error rate
performance:
metrics:
- Latency p50/p95/p99 by model
- Cache hit rate
- RAG retrieval quality (precision@5)
- Agent tool call success rate
cost:
metrics:
- Cost by model (pie chart)
- Daily cost trend (30 days)
- Cost per conversation
- Budget utilization (%)
quality:
metrics:
- CSAT score
- Resolution rate
- Escalation rate
- Safety filter blocks
alerts:
- name: Budget Alert
condition: daily_cost > $400
channels: [email, slack]
severity: warning
- name: Error Rate High
condition: error_rate > 2%
channels: [pagerduty]
severity: critical
- name: Latency Degradation
condition: p95_latency > 3000ms
channels: [slack]
severity: warning
- name: Cache Hit Rate Low
condition: cache_hit_rate < 90%
channels: [email]
severity: info
- name: Safety Filter Spike
condition: safety_blocks > 100/hour
channels: [email, slack]
severity: warning
๐ Criteres Evaluation
| Critere | Points | Description |
|---|---|---|
| Architecture | 25 | Completude, coherence, scalabilite |
| Cost Optimization | 20 | Model routing, caching, cost model realiste |
| Security | 20 | VPC-SC, DLP, IAM, compliance |
| Implementation Plan | 15 | Realisme, timeline, dependencies |
| Monitoring | 10 | Metriques pertinentes, alertes actionable |
| Documentation | 10 | ADRs, diagrammes, clarte |
Total : 100 points
Seuil validation : 70/100
Ce projet final synthetise tout Phase 4. Architecture solide = fondation succes production. Prenez temps pour bien concevoir avant implementer. Validez assumptions avec calculs couts. Documentez decisions (ADRs). En entreprise, ce type design doc = prerequisite avant dev sprint. Qualite architecture determine succes long-terme projet IA.
Examen Final & Certification
๐ฏ Objectif
Valider maitrise complete Phase 4 : Deploiement Enterprise, FinOps, Gouvernance.
Format : 30 questions QCM + validation projet final
Duree : 60 minutes
Seuil reussite : 24/30 (80%)
๐ Examen Final : 30 Questions
1. Quelle difference principale entre AI Studio et Vertex AI ?
2. Pour deployer API Gemini serverless, quelle solution ?
3. Comment eliminer cold starts Cloud Run ?
4. VPC Service Controls permet de :
5. Ou stocker API keys securise ?
6. DLP (Data Loss Prevention) sert a :
7. Tiered pricing Gemini signifie :
8. Context caching reduit cout de combien ?
9. Model routing intelligent peut economiser :
10. Batch API offre reduction cout de :
11. Context caching rentable des combien requetes ?
12. Flash-8B coute combien vs Pro ?
13. BigQuery billing export est :
14. Budget alerts recommandes a :
15. Output tokens coutent combien vs input (Flash) ?
16. Les 7 principes Google AI incluent :
17. Safety settings BLOCK_LOW_AND_ABOVE signifie :
18. Gemma Scope permet :
19. Model lifecycle stages sont :
20. ADR (Architecture Decision Record) documente :
21. Tool governance HIGH risk requiert :
22. Agent audit trail doit logger :
23. Gemma 3 vs Gemini difference principale :
24. Gemma Nano use case principal :
25. NotebookLM permet :
26. Astra DB est optimise pour :
27. Universal Agent vision 2027+ :
28. Generative UI paradigm shift :
29. On-device AI principal avantage :
30. Canary deployment signifie :
๐ Obtenir la Certification
Criteres validation :
- โ Examen final : 24/30 minimum (80%)
- โ Projet final : 70/100 minimum
- โ Toutes lecons Phase 4 completees
Certification delivree :
- PDF certificate avec QR code verification
- Badge LinkedIn "Architecte Gemini Enterprise"
- Acces communaute architectes certifies
๐ Prochaines Etapes
Felicitations pour avoir complete Phase 4 !
Vous maitrisez maintenant :
- โ Deploiement enterprise production-ready
- โ FinOps & optimisation couts (60-80% economie)
- โ Gouvernance modeles et agents
- โ IA Responsable et compliance
- โ Ecosysteme Google AI complet
Continuer apprentissage :
- Implementer projet reel : Appliquez architecture sur use case entreprise
- Contribuer open-source : Gemma, Gemma Scope, VertexAI samples
- Rejoindre communaute : Google AI Discord, forums GCP
- Suivre actualites : Google AI Blog, Vertex AI release notes
- Certifications complementaires : Professional Cloud Architect GCP
Resources :
- ๐ Documentation : cloud.google.com/vertex-ai/docs
- ๐ฌ Community : discord.gg/google-ai
- ๐ฅ Videos : YouTube @GoogleCloudTech
- ๐ฐ Blog : cloud.google.com/blog/products/ai-machine-learning
Vous etes maintenant Architecte Gemini certifie. Allez builder des applications IA incroyables !