AWS CI/CD 实战系列 08:监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试
系列导读: 上一篇我们锁好了权限安全的大门。但安全只是底线,真正让 CI/CD 在生产环境"能用"的是可观测性——出了问题你能看到、能收到告警、能快速回滚。本篇从日志、告警、回滚、重试四个维度,构建一套完整的 CI/CD 监控体系。
监控体系全景图
坑 1:日志分散,找不到错误在哪
现象
流水线失败了,控制台只显示一个红叉:
Pipeline execution failed at stage 'Build'
你得依次打开 CodePipeline → CodeBuild → CloudWatch Logs,来回切换找错误。
解决方案:统一日志结构
CodeBuild 日志配置
# buildspec.yml
version: 0.2
env:
variables:
LOG_GROUP: "/aws/codebuild/mfmsapp"
LOG_STREAM_PREFIX: "build"
phases:
pre_build:
commands:
- echo "[START] $(date -u +%Y-%m-%dT%H:%M:%SZ) Build started"
- echo "[INFO] Commit: $CODEBUILD_RESOLVED_SOURCE_VERSION"
- echo "[INFO] Branch: $CODEBUILD_WEBHOOK_HEAD_REF"
build:
commands:
- echo "[BUILD] Compiling Go binary..."
- go build -v -o mfmsapp ./...
- echo "[BUILD] Binary size: $(ls -lh mfmsapp | awk '{print $5}')"
- echo "[TEST] Running tests..."
- go test -v ./... 2>&1 | tee test-output.txt
- echo "[TEST] Exit code: $?"
post_build:
commands:
- echo "[END] $(date -u +%Y-%m-%dT%H:%M:%SZ) Build completed"
- |
if [ "$CODEBUILD_BUILD_SUCCEEDING" = "1" ]; then
echo "[RESULT] BUILD SUCCESS"
else
echo "[RESULT] BUILD FAILED"
echo "[DIAG] Last 20 lines of output:"
tail -20 test-output.txt
fi
logs:
group-name: /aws/codebuild/mfmsapp
stream-name: build/{year}/{month}/{day}/{hour}
日志结构化后,用 CloudWatch Insights 快速搜索:
fields @timestamp, @message
| filter @message like /ERROR|FAIL|WARN/
| sort @timestamp desc
| limit 50
坑 2:流水线失败无人知晓
现象
周五下午部署失败,直到周一早上用户反馈才发现。3 天的空窗期。
解决方案:EventBridge + SNS 实时告警
步骤 1:创建 SNS Topic
aws sns create-topic --name mfmsapp-cicd-alerts
# 订阅 Email
aws sns subscribe \
--topic-arn arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
--protocol email \
--notification-endpoint your-team@company.com
# 确认订阅(检查邮箱)
步骤 2:创建 EventBridge 规则
{
"source": ["aws.codepipeline"],
"detail-type": ["CodePipeline Pipeline Execution State Change"],
"detail": {
"pipeline": ["mfmsapp-pipeline"],
"state": ["FAILED", "SUCCEEDED", "STARTED"]
}
}
aws events put-rule \
--name mfmsapp-pipeline-monitor \
--event-pattern file://event-pattern.json \
--state ENABLED
# 关联 SNS Target
aws events put-targets \
--rule mfmsapp-pipeline-monitor \
--targets Id=1,Arn=arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts
步骤 3:自定义通知消息(可选但推荐)
默认 SNS 消息是原始 JSON,不好读。用 Lambda 美化:
import json
import boto3
from datetime import datetime
sns = boto3.client('sns')
TOPIC_ARN = 'arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts'
STATUS_EMOJI = {
'STARTED': '🚀',
'SUCCEEDED': '✅',
'FAILED': '❌',
'CANCELED': '⚠️'
}
def lambda_handler(event, context):
detail = event['detail']
pipeline = detail['pipeline']
state = detail['state']
execution_id = detail['execution-id']
emoji = STATUS_EMOJI.get(state, '❓')
message = f"""{emoji} CI/CD Pipeline 状态变更
流水线: {pipeline}
状态: {state}
执行ID: {execution_id}
时间: {event['time']}
详情: https://ap-northeast-1.console.aws.amazon.com/codesuite/codepipeline/pipelines/{pipeline}/view
"""
if state == 'FAILED':
message += f"""
⚡ 快速排障:
1. 打开上面的链接查看失败阶段
2. 检查 CloudWatch Logs: /aws/codebuild/{pipeline}
3. 确认最近是否有代码/配置变更
"""
sns.publish(
TopicArn=TOPIC_ARN,
Subject=f'{emoji} [{pipeline}] {state}',
Message=message
)
return {'statusCode': 200}
坑 3:部署失败无法快速回滚
现象
v3 部署到 EC2 后应用崩溃,手动 SSH 上去恢复要 10 分钟。在回滚期间,用户看到 500 错误。
解决方案:CodeDeploy 自动 Rollback + 健康检查
配置自动回滚
{
"applicationName": "mfmsapp",
"deploymentGroupName": "mfmsapp-prod",
"deploymentConfigName": "CodeDeployDefault.OneAtATime",
"autoRollbackConfiguration": {
"enabled": true,
"events": [
"DEPLOYMENT_FAILURE",
"DEPLOYMENT_STOP_ON_REQUEST",
"DEPLOYMENT_STOP_ON_ALARM"
]
},
"alarmConfiguration": {
"alarms": [
{
"name": "mfmsapp-http-5xx-high"
}
],
"enabled": true
}
}
配置健康检查脚本
在 appspec.yml 的 AfterInstall 生命周期中添加健康检查:
# appspec.yml
version: 0.0
os: linux
files:
- source: /mfmsapp
destination: /opt/mfmsapp
hooks:
BeforeInstall:
- location: scripts/stop.sh
timeout: 30
AfterInstall:
- location: scripts/start.sh
timeout: 60
ValidateService:
- location: scripts/health_check.sh
timeout: 120
runas: root
#!/bin/bash
# scripts/health_check.sh
APP_URL="http://localhost:8080/health"
MAX_RETRIES=10
RETRY_INTERVAL=5
echo "[HEALTH] Starting health check for $APP_URL"
for i in $(seq 1 $MAX_RETRIES); do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo "[HEALTH] ✅ Health check passed on attempt $i (HTTP $HTTP_CODE)"
exit 0
fi
echo "[HEALTH] Attempt $i/$MAX_RETRIES: HTTP $HTTP_CODE (waiting ${RETRY_INTERVAL}s...)"
sleep $RETRY_INTERVAL
done
echo "[HEALTH] ❌ Health check failed after $MAX_RETRIES attempts"
exit 1
创建 CloudWatch Alarm
aws cloudwatch put-metric-alarm \
--alarm-name mfmsapp-http-5xx-high \
--namespace "AWS/ApplicationELB" \
--metric-name HTTPCode_Target_5XX_Count \
--dimensions Name=LoadBalancer,Value=app/mfmsapp-alb/xxxxx \
--statistic Sum \
--period 60 \
--evaluation-periods 2 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
--treat-missing-data notBreaching
坑 4:构建失败后无法重试
现象
CodeBuild 构建因临时问题失败(Docker Hub 限流、NPM Registry 超时),但 CodePipeline 不会自动重试,你得手动点"Release change"重新跑整个流水线。
解决方案
方案 A:buildspec 内置重试逻辑
phases:
install:
commands:
- |
retry() {
local n=0
local max=3
local cmd=$@
until [ $n -ge $max ]; do
$cmd && break
n=$((n+1))
echo "[RETRY] Attempt $n/$max failed. Retrying in 5s..."
sleep 5
done
if [ $n -ge $max ]; then
echo "[RETRY] ❌ All $max attempts failed for: $cmd"
exit 1
fi
}
build:
commands:
- retry go mod download
- retry docker pull golang:1.22
方案 B:CodePipeline 自动重试配置
{
"stageName": "Build",
"retryMode": "FAILED_ACTIONS",
"maxRetries": 2
}
在 Pipeline 定义中配置:
aws codepipeline get-pipeline --name mfmsapp-pipeline > pipeline.json
# 在 stages 数组的 Build stage 中加入 retryConfiguration
# 然后更新:
aws codepipeline update-pipeline --cli-input-json file://pipeline.json
方案 C:Lambda 自动重试
import boto3
codepipeline = boto3.client('codepipeline')
def lambda_handler(event, context):
# EventBridge 捕获失败事件
detail = event['detail']
pipeline_name = detail['pipeline']
execution_id = detail['execution-id']
# 检查是否已重试过(防止无限循环)
response = codepipeline.get_pipeline_execution(
pipelineName=pipeline_name,
executionId=execution_id
)
if response['pipelineExecution']['trigger'].get('triggerType') == 'Retry':
print(f"Already a retry, skipping. Execution: {execution_id}")
return
# 获取失败的 stage
stages = response['pipelineExecution'].get('stageStates', [])
failed_stage = next((s for s in stages if s.get('latestExecution', {}).get('status') == 'Failed'), None)
if failed_stage:
stage_name = failed_stage['stageName']
print(f"Retrying failed stage: {stage_name}")
# 重试该 stage
codepipeline.retry_stage_execution(
pipelineName=pipeline_name,
executionId=execution_id,
stageName=stage_name,
retryMode='FAILED_ACTIONS'
)
坑 5:监控指标不全面
现象
你只监控了"流水线是否成功",但忽略了:
构建时间是否越来越长(依赖膨胀、缓存失效)
部署频率是否下降(团队不敢部署)
失败率是否上升(代码质量下降)
解决方案:DORA 指标监控
用 CloudWatch Metrics + 自定义指标追踪 CI/CD 健康度:
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
codepipeline = boto3.client('codepipeline')
def publish_cicd_metrics():
now = datetime.utcnow()
one_week_ago = now - timedelta(days=7)
# 获取过去一周的执行记录
executions = codepipeline.list_pipeline_executions(
pipelineName='mfmsapp-pipeline',
startTimeBefore=now,
startTimeAfter=one_week_ago
)
total = len(executions['pipelineExecutionSummaries'])
succeeded = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Succeeded')
failed = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Failed')
success_rate = (succeeded / total * 100) if total > 0 else 0
# 发布指标
cloudwatch.put_metric_data(
Namespace='Custom/CICD',
MetricData=[
{
'MetricName': 'DeploymentFrequency',
'Value': total,
'Unit': 'Count',
'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
},
{
'MetricName': 'ChangeFailureRate',
'Value': 100 - success_rate,
'Unit': 'Percent',
'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
},
{
'MetricName': 'SuccessfulDeployments',
'Value': succeeded,
'Unit': 'Count',
'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
}
]
)
print(f"DORA Metrics: Total={total}, Succeeded={succeeded}, Failed={failed}, SuccessRate={success_rate:.1f}%")
配置告警阈值:
# 成功率低于 80% 告警
aws cloudwatch put-metric-alarm \
--alarm-name mfmsapp-deployment-success-rate-low \
--namespace "Custom/CICD" \
--metric-name ChangeFailureRate \
--dimensions Name=Pipeline,Value=mfmsapp \
--statistic Average \
--period 86400 \
--evaluation-periods 1 \
--threshold 20 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts
监控 Dashboard 模板
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "Pipeline Execution Status",
"metrics": [
["AWS/CodePipeline", "PipelineExecutionSuccess", "PipelineName", "mfmsapp-pipeline"],
[".", "PipelineExecutionFailure", ".", "."]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"properties": {
"title": "Build Duration",
"metrics": [
["AWS/CodeBuild", "Duration", "ProjectName", "mfmsapp-build"]
],
"period": 300,
"stat": "Average"
}
},
{
"type": "metric",
"properties": {
"title": "DORA - Deployment Frequency & Failure Rate",
"metrics": [
["Custom/CICD", "DeploymentFrequency", "Pipeline", "mfmsapp"],
[".", "ChangeFailureRate", ".", "."]
],
"period": 86400,
"stat": "Average"
}
},
{
"type": "log",
"properties": {
"title": "Recent Errors",
"query": "SOURCE '/aws/codebuild/mfmsapp' | fields @timestamp, @message | filter @message like /ERROR|FAIL/ | sort @timestamp desc | limit 20",
"region": "ap-northeast-1",
"stacked": false
}
}
]
}
aws cloudwatch put-dashboard \
--dashboard-name mfmsapp-cicd \
--dashboard-body file://dashboard.json
故障排查决策树
总结
CI/CD 监控的核心是四个能力:
能看:统一日志结构 + CloudWatch Insights 快速搜索
能收:EventBridge + SNS 实时通知
能滚:CodeDeploy 自动回滚 + 健康检查
能量:DORA 指标追踪 CI/CD 健康度
做到这四点,你的流水线才算"生产就绪"。
相关文档
下一篇: 系列 09:进阶技巧:多环境 + 跨区域部署 —— 参数管理、蓝绿部署、Artifact 复制、并行构建,让你的 CI/CD 从"能用"进化到"好用"。