AWS CI/CD 实战系列 08：监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试

系列导读： 上一篇我们锁好了权限安全的大门。但安全只是底线，真正让 CI/CD 在生产环境"能用"的是可观测性——出了问题你能看到、能收到告警、能快速回滚。本篇从日志、告警、回滚、重试四个维度，构建一套完整的 CI/CD 监控体系。

监控体系全景图

坑 1：日志分散，找不到错误在哪

现象

流水线失败了，控制台只显示一个红叉：

Pipeline execution failed at stage 'Build'

你得依次打开 CodePipeline → CodeBuild → CloudWatch Logs，来回切换找错误。

解决方案：统一日志结构

CodeBuild 日志配置

# buildspec.yml
version: 0.2

env:
  variables:
    LOG_GROUP: "/aws/codebuild/mfmsapp"
    LOG_STREAM_PREFIX: "build"

phases:
  pre_build:
    commands:
      - echo "[START] $(date -u +%Y-%m-%dT%H:%M:%SZ) Build started"
      - echo "[INFO]  Commit: $CODEBUILD_RESOLVED_SOURCE_VERSION"
      - echo "[INFO]  Branch: $CODEBUILD_WEBHOOK_HEAD_REF"
  build:
    commands:
      - echo "[BUILD]  Compiling Go binary..."
      - go build -v -o mfmsapp ./...
      - echo "[BUILD]  Binary size: $(ls -lh mfmsapp | awk '{print $5}')"
      - echo "[TEST]   Running tests..."
      - go test -v ./... 2>&1 | tee test-output.txt
      - echo "[TEST]   Exit code: $?"
  post_build:
    commands:
      - echo "[END]    $(date -u +%Y-%m-%dT%H:%M:%SZ) Build completed"
      - |
        if [ "$CODEBUILD_BUILD_SUCCEEDING" = "1" ]; then
          echo "[RESULT] BUILD SUCCESS"
        else
          echo "[RESULT] BUILD FAILED"
          echo "[DIAG]   Last 20 lines of output:"
          tail -20 test-output.txt
        fi

logs:
  group-name: /aws/codebuild/mfmsapp
  stream-name: build/{year}/{month}/{day}/{hour}

日志结构化后，用 CloudWatch Insights 快速搜索：

fields @timestamp, @message
| filter @message like /ERROR|FAIL|WARN/
| sort @timestamp desc
| limit 50

坑 2：流水线失败无人知晓

现象

周五下午部署失败，直到周一早上用户反馈才发现。3 天的空窗期。

解决方案：EventBridge + SNS 实时告警

aws sns create-topic --name mfmsapp-cicd-alerts

# 订阅 Email
aws sns subscribe \
  --topic-arn arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
  --protocol email \
  --notification-endpoint your-team@company.com

# 确认订阅（检查邮箱）

步骤 2：创建 EventBridge 规则

{
  "source": ["aws.codepipeline"],
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "detail": {
    "pipeline": ["mfmsapp-pipeline"],
    "state": ["FAILED", "SUCCEEDED", "STARTED"]
  }
}

aws events put-rule \
  --name mfmsapp-pipeline-monitor \
  --event-pattern file://event-pattern.json \
  --state ENABLED

# 关联 SNS Target
aws events put-targets \
  --rule mfmsapp-pipeline-monitor \
  --targets Id=1,Arn=arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts

步骤 3：自定义通知消息（可选但推荐）

默认 SNS 消息是原始 JSON，不好读。用 Lambda 美化：

import json
import boto3
from datetime import datetime

sns = boto3.client('sns')
TOPIC_ARN = 'arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts'

STATUS_EMOJI = {
    'STARTED': '🚀',
    'SUCCEEDED': '✅',
    'FAILED': '❌',
    'CANCELED': '⚠️'
}

def lambda_handler(event, context):
    detail = event['detail']
    pipeline = detail['pipeline']
    state = detail['state']
    execution_id = detail['execution-id']
    
    emoji = STATUS_EMOJI.get(state, '❓')
    
    message = f"""{emoji} CI/CD Pipeline 状态变更

流水线: {pipeline}
状态: {state}
执行ID: {execution_id}
时间: {event['time']}

详情: https://ap-northeast-1.console.aws.amazon.com/codesuite/codepipeline/pipelines/{pipeline}/view
"""
    
    if state == 'FAILED':
        message += f"""
⚡ 快速排障:
1. 打开上面的链接查看失败阶段
2. 检查 CloudWatch Logs: /aws/codebuild/{pipeline}
3. 确认最近是否有代码/配置变更
"""
    
    sns.publish(
        TopicArn=TOPIC_ARN,
        Subject=f'{emoji} [{pipeline}] {state}',
        Message=message
    )
    
    return {'statusCode': 200}

坑 3：部署失败无法快速回滚

现象

v3 部署到 EC2 后应用崩溃，手动 SSH 上去恢复要 10 分钟。在回滚期间，用户看到 500 错误。

解决方案：CodeDeploy 自动 Rollback + 健康检查

配置自动回滚

{
  "applicationName": "mfmsapp",
  "deploymentGroupName": "mfmsapp-prod",
  "deploymentConfigName": "CodeDeployDefault.OneAtATime",
  "autoRollbackConfiguration": {
    "enabled": true,
    "events": [
      "DEPLOYMENT_FAILURE",
      "DEPLOYMENT_STOP_ON_REQUEST",
      "DEPLOYMENT_STOP_ON_ALARM"
    ]
  },
  "alarmConfiguration": {
    "alarms": [
      {
        "name": "mfmsapp-http-5xx-high"
      }
    ],
    "enabled": true
  }
}

配置健康检查脚本

在 appspec.yml 的 AfterInstall 生命周期中添加健康检查：

# appspec.yml
version: 0.0
os: linux
files:
  - source: /mfmsapp
    destination: /opt/mfmsapp
hooks:
  BeforeInstall:
    - location: scripts/stop.sh
      timeout: 30
  AfterInstall:
    - location: scripts/start.sh
      timeout: 60
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 120
      runas: root

#!/bin/bash
# scripts/health_check.sh

APP_URL="http://localhost:8080/health"
MAX_RETRIES=10
RETRY_INTERVAL=5

echo "[HEALTH] Starting health check for $APP_URL"

for i in $(seq 1 $MAX_RETRIES); do
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL 2>/dev/null)
    
    if [ "$HTTP_CODE" = "200" ]; then
        echo "[HEALTH] ✅ Health check passed on attempt $i (HTTP $HTTP_CODE)"
        exit 0
    fi
    
    echo "[HEALTH] Attempt $i/$MAX_RETRIES: HTTP $HTTP_CODE (waiting ${RETRY_INTERVAL}s...)"
    sleep $RETRY_INTERVAL
done

echo "[HEALTH] ❌ Health check failed after $MAX_RETRIES attempts"
exit 1

创建 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name mfmsapp-http-5xx-high \
  --namespace "AWS/ApplicationELB" \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/mfmsapp-alb/xxxxx \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
  --treat-missing-data notBreaching

坑 4：构建失败后无法重试

现象

CodeBuild 构建因临时问题失败（Docker Hub 限流、NPM Registry 超时），但 CodePipeline 不会自动重试，你得手动点"Release change"重新跑整个流水线。

解决方案

方案 A：buildspec 内置重试逻辑

phases:
  install:
    commands:
      - |
        retry() {
          local n=0
          local max=3
          local cmd=$@
          until [ $n -ge $max ]; do
            $cmd && break
            n=$((n+1))
            echo "[RETRY] Attempt $n/$max failed. Retrying in 5s..."
            sleep 5
          done
          if [ $n -ge $max ]; then
            echo "[RETRY] ❌ All $max attempts failed for: $cmd"
            exit 1
          fi
        }
  build:
    commands:
      - retry go mod download
      - retry docker pull golang:1.22

方案 B：CodePipeline 自动重试配置

{
  "stageName": "Build",
  "retryMode": "FAILED_ACTIONS",
  "maxRetries": 2
}

在 Pipeline 定义中配置：

aws codepipeline get-pipeline --name mfmsapp-pipeline > pipeline.json

# 在 stages 数组的 Build stage 中加入 retryConfiguration
# 然后更新：
aws codepipeline update-pipeline --cli-input-json file://pipeline.json

方案 C：Lambda 自动重试

import boto3

codepipeline = boto3.client('codepipeline')

def lambda_handler(event, context):
    # EventBridge 捕获失败事件
    detail = event['detail']
    pipeline_name = detail['pipeline']
    execution_id = detail['execution-id']
    
    # 检查是否已重试过（防止无限循环）
    response = codepipeline.get_pipeline_execution(
        pipelineName=pipeline_name,
        executionId=execution_id
    )
    
    if response['pipelineExecution']['trigger'].get('triggerType') == 'Retry':
        print(f"Already a retry, skipping. Execution: {execution_id}")
        return
    
    # 获取失败的 stage
    stages = response['pipelineExecution'].get('stageStates', [])
    failed_stage = next((s for s in stages if s.get('latestExecution', {}).get('status') == 'Failed'), None)
    
    if failed_stage:
        stage_name = failed_stage['stageName']
        print(f"Retrying failed stage: {stage_name}")
        
        # 重试该 stage
        codepipeline.retry_stage_execution(
            pipelineName=pipeline_name,
            executionId=execution_id,
            stageName=stage_name,
            retryMode='FAILED_ACTIONS'
        )

坑 5：监控指标不全面

现象

你只监控了"流水线是否成功"，但忽略了：

构建时间是否越来越长（依赖膨胀、缓存失效）
部署频率是否下降（团队不敢部署）
失败率是否上升（代码质量下降）

解决方案：DORA 指标监控

用 CloudWatch Metrics + 自定义指标追踪 CI/CD 健康度：

import boto3
from datetime import datetime, timedelta

cloudwatch = boto3.client('cloudwatch')
codepipeline = boto3.client('codepipeline')

def publish_cicd_metrics():
    now = datetime.utcnow()
    one_week_ago = now - timedelta(days=7)
    
    # 获取过去一周的执行记录
    executions = codepipeline.list_pipeline_executions(
        pipelineName='mfmsapp-pipeline',
        startTimeBefore=now,
        startTimeAfter=one_week_ago
    )
    
    total = len(executions['pipelineExecutionSummaries'])
    succeeded = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Succeeded')
    failed = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Failed')
    
    success_rate = (succeeded / total * 100) if total > 0 else 0
    
    # 发布指标
    cloudwatch.put_metric_data(
        Namespace='Custom/CICD',
        MetricData=[
            {
                'MetricName': 'DeploymentFrequency',
                'Value': total,
                'Unit': 'Count',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            },
            {
                'MetricName': 'ChangeFailureRate',
                'Value': 100 - success_rate,
                'Unit': 'Percent',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            },
            {
                'MetricName': 'SuccessfulDeployments',
                'Value': succeeded,
                'Unit': 'Count',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            }
        ]
    )
    
    print(f"DORA Metrics: Total={total}, Succeeded={succeeded}, Failed={failed}, SuccessRate={success_rate:.1f}%")

配置告警阈值：

# 成功率低于 80% 告警
aws cloudwatch put-metric-alarm \
  --alarm-name mfmsapp-deployment-success-rate-low \
  --namespace "Custom/CICD" \
  --metric-name ChangeFailureRate \
  --dimensions Name=Pipeline,Value=mfmsapp \
  --statistic Average \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold 20 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts

监控 Dashboard 模板

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Pipeline Execution Status",
        "metrics": [
          ["AWS/CodePipeline", "PipelineExecutionSuccess", "PipelineName", "mfmsapp-pipeline"],
          [".", "PipelineExecutionFailure", ".", "."]
        ],
        "period": 300,
        "stat": "Sum"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Build Duration",
        "metrics": [
          ["AWS/CodeBuild", "Duration", "ProjectName", "mfmsapp-build"]
        ],
        "period": 300,
        "stat": "Average"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "DORA - Deployment Frequency & Failure Rate",
        "metrics": [
          ["Custom/CICD", "DeploymentFrequency", "Pipeline", "mfmsapp"],
          [".", "ChangeFailureRate", ".", "."]
        ],
        "period": 86400,
        "stat": "Average"
      }
    },
    {
      "type": "log",
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/codebuild/mfmsapp' | fields @timestamp, @message | filter @message like /ERROR|FAIL/ | sort @timestamp desc | limit 20",
        "region": "ap-northeast-1",
        "stacked": false
      }
    }
  ]
}

aws cloudwatch put-dashboard \
  --dashboard-name mfmsapp-cicd \
  --dashboard-body file://dashboard.json

故障排查决策树

现象	首先检查	常见原因	快速修复
流水线红叉	哪个阶段失败	-	点击阶段查看日志
Source 阶段失败	CodeCommit/S3 连接	权限过期、仓库不存在	检查 EventBridge 规则和 IAM
Build 阶段失败	CodeBuild 日志	编译错误、依赖下载失败	重试或修复 buildspec
Deploy 阶段失败	CodeDeploy 日志	健康检查失败、脚本超时	检查 appspec.yml 和脚本
构建超时	CodeBuild Duration	依赖膨胀、缓存失效	配置 S3 缓存或增大 timeout
部署后 5xx 飙升	CloudWatch Alarm	应用 bug、配置错误	自动回滚或手动回滚

总结

CI/CD 监控的核心是四个能力：

能看：统一日志结构 + CloudWatch Insights 快速搜索
能收：EventBridge + SNS 实时通知
能滚：CodeDeploy 自动回滚 + 健康检查
能量：DORA 指标追踪 CI/CD 健康度

做到这四点，你的流水线才算"生产就绪"。

MK博客

AWS CI/CD 实战系列 08：监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试

AWS CI/CD 实战系列 08：监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试

监控体系全景图

坑 1：日志分散，找不到错误在哪

现象

解决方案：统一日志结构

CodeBuild 日志配置

坑 2：流水线失败无人知晓

现象

解决方案：EventBridge + SNS 实时告警

步骤 2：创建 EventBridge 规则

步骤 3：自定义通知消息（可选但推荐）

坑 3：部署失败无法快速回滚

现象

解决方案：CodeDeploy 自动 Rollback + 健康检查

配置自动回滚

配置健康检查脚本

创建 CloudWatch Alarm

坑 4：构建失败后无法重试

现象

解决方案

方案 A：buildspec 内置重试逻辑

方案 B：CodePipeline 自动重试配置

方案 C：Lambda 自动重试

坑 5：监控指标不全面

现象

解决方案：DORA 指标监控

监控 Dashboard 模板

故障排查决策树

总结

相关文档

MK

AWS CI/CD 实战系列 08：监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试

监控体系全景图

坑 1：日志分散，找不到错误在哪

现象

解决方案：统一日志结构

CodeBuild 日志配置

坑 2：流水线失败无人知晓

现象

解决方案：EventBridge + SNS 实时告警

步骤 1：创建 SNS Topic

步骤 2：创建 EventBridge 规则

步骤 3：自定义通知消息（可选但推荐）

坑 3：部署失败无法快速回滚

现象

解决方案：CodeDeploy 自动 Rollback + 健康检查

配置自动回滚

配置健康检查脚本

创建 CloudWatch Alarm

坑 4：构建失败后无法重试

现象

解决方案

方案 A：buildspec 内置重试逻辑

方案 B：CodePipeline 自动重试配置

方案 C：Lambda 自动重试

坑 5：监控指标不全面

现象

解决方案：DORA 指标监控

监控 Dashboard 模板

故障排查决策树

总结

相关文档

AWS CI/CD 实战系列 09：进阶技巧 —— 多环境管理、蓝绿部署、Artifact 跨区域复制与并行构建

AWS CI/CD 实战系列 07：权限安全深度解析 —— IAM 信任关系、最小权限、KMS 加密与 VPC 内构建

MK