AWS CI/CD 实战系列 08:监控与故障排查 —— CloudWatch 日志、SNS 告警、Rollback 配置与手动重试

系列导读: 上一篇我们锁好了权限安全的大门。但安全只是底线,真正让 CI/CD 在生产环境"能用"的是可观测性——出了问题你能看到、能收到告警、能快速回滚。本篇从日志、告警、回滚、重试四个维度,构建一套完整的 CI/CD 监控体系。


监控体系全景图

graph TB A[CodePipeline] -->|状态变更| B[EventBridge] B -->|成功/失败| C[SNS Topic] C -->|通知| D[Email / Slack / 钉钉] C -->|自动触发| E[Lambda 回滚函数] A -->|执行日志| F[CloudWatch Logs] F -->|指标过滤| G[CloudWatch Metrics] G -->|阈值告警| H[CloudWatch Alarm] H -->|触发| C I[CodeBuild] -->|构建日志| F J[CodeDeploy] -->|部署日志| F J -->|生命周期事件| K[EC2 实例日志] style A fill:#FF9900,stroke:#232F3E,color:#fff style B fill:#FF4F8B,stroke:#232F3E,color:#fff style C fill:#3F8624,stroke:#232F3E,color:#fff style F fill:#1A73E8,stroke:#232F3E,color:#fff

坑 1:日志分散,找不到错误在哪

现象

流水线失败了,控制台只显示一个红叉:

Pipeline execution failed at stage 'Build'

你得依次打开 CodePipeline → CodeBuild → CloudWatch Logs,来回切换找错误。

解决方案:统一日志结构

CodeBuild 日志配置

# buildspec.yml
version: 0.2

env:
  variables:
    LOG_GROUP: "/aws/codebuild/mfmsapp"
    LOG_STREAM_PREFIX: "build"

phases:
  pre_build:
    commands:
      - echo "[START] $(date -u +%Y-%m-%dT%H:%M:%SZ) Build started"
      - echo "[INFO]  Commit: $CODEBUILD_RESOLVED_SOURCE_VERSION"
      - echo "[INFO]  Branch: $CODEBUILD_WEBHOOK_HEAD_REF"
  build:
    commands:
      - echo "[BUILD]  Compiling Go binary..."
      - go build -v -o mfmsapp ./...
      - echo "[BUILD]  Binary size: $(ls -lh mfmsapp | awk '{print $5}')"
      - echo "[TEST]   Running tests..."
      - go test -v ./... 2>&1 | tee test-output.txt
      - echo "[TEST]   Exit code: $?"
  post_build:
    commands:
      - echo "[END]    $(date -u +%Y-%m-%dT%H:%M:%SZ) Build completed"
      - |
        if [ "$CODEBUILD_BUILD_SUCCEEDING" = "1" ]; then
          echo "[RESULT] BUILD SUCCESS"
        else
          echo "[RESULT] BUILD FAILED"
          echo "[DIAG]   Last 20 lines of output:"
          tail -20 test-output.txt
        fi

logs:
  group-name: /aws/codebuild/mfmsapp
  stream-name: build/{year}/{month}/{day}/{hour}

日志结构化后,用 CloudWatch Insights 快速搜索:

fields @timestamp, @message
| filter @message like /ERROR|FAIL|WARN/
| sort @timestamp desc
| limit 50

坑 2:流水线失败无人知晓

现象

周五下午部署失败,直到周一早上用户反馈才发现。3 天的空窗期

解决方案:EventBridge + SNS 实时告警

步骤 1:创建 SNS Topic

aws sns create-topic --name mfmsapp-cicd-alerts

# 订阅 Email
aws sns subscribe \
  --topic-arn arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
  --protocol email \
  --notification-endpoint your-team@company.com

# 确认订阅(检查邮箱)

步骤 2:创建 EventBridge 规则

{
  "source": ["aws.codepipeline"],
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "detail": {
    "pipeline": ["mfmsapp-pipeline"],
    "state": ["FAILED", "SUCCEEDED", "STARTED"]
  }
}
aws events put-rule \
  --name mfmsapp-pipeline-monitor \
  --event-pattern file://event-pattern.json \
  --state ENABLED

# 关联 SNS Target
aws events put-targets \
  --rule mfmsapp-pipeline-monitor \
  --targets Id=1,Arn=arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts

步骤 3:自定义通知消息(可选但推荐)

默认 SNS 消息是原始 JSON,不好读。用 Lambda 美化:

import json
import boto3
from datetime import datetime

sns = boto3.client('sns')
TOPIC_ARN = 'arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts'

STATUS_EMOJI = {
    'STARTED': '🚀',
    'SUCCEEDED': '✅',
    'FAILED': '❌',
    'CANCELED': '⚠️'
}

def lambda_handler(event, context):
    detail = event['detail']
    pipeline = detail['pipeline']
    state = detail['state']
    execution_id = detail['execution-id']
    
    emoji = STATUS_EMOJI.get(state, '❓')
    
    message = f"""{emoji} CI/CD Pipeline 状态变更

流水线: {pipeline}
状态: {state}
执行ID: {execution_id}
时间: {event['time']}

详情: https://ap-northeast-1.console.aws.amazon.com/codesuite/codepipeline/pipelines/{pipeline}/view
"""
    
    if state == 'FAILED':
        message += f"""
⚡ 快速排障:
1. 打开上面的链接查看失败阶段
2. 检查 CloudWatch Logs: /aws/codebuild/{pipeline}
3. 确认最近是否有代码/配置变更
"""
    
    sns.publish(
        TopicArn=TOPIC_ARN,
        Subject=f'{emoji} [{pipeline}] {state}',
        Message=message
    )
    
    return {'statusCode': 200}

坑 3:部署失败无法快速回滚

现象

v3 部署到 EC2 后应用崩溃,手动 SSH 上去恢复要 10 分钟。在回滚期间,用户看到 500 错误。

解决方案:CodeDeploy 自动 Rollback + 健康检查

配置自动回滚

{
  "applicationName": "mfmsapp",
  "deploymentGroupName": "mfmsapp-prod",
  "deploymentConfigName": "CodeDeployDefault.OneAtATime",
  "autoRollbackConfiguration": {
    "enabled": true,
    "events": [
      "DEPLOYMENT_FAILURE",
      "DEPLOYMENT_STOP_ON_REQUEST",
      "DEPLOYMENT_STOP_ON_ALARM"
    ]
  },
  "alarmConfiguration": {
    "alarms": [
      {
        "name": "mfmsapp-http-5xx-high"
      }
    ],
    "enabled": true
  }
}

配置健康检查脚本

appspec.ymlAfterInstall 生命周期中添加健康检查:

# appspec.yml
version: 0.0
os: linux
files:
  - source: /mfmsapp
    destination: /opt/mfmsapp
hooks:
  BeforeInstall:
    - location: scripts/stop.sh
      timeout: 30
  AfterInstall:
    - location: scripts/start.sh
      timeout: 60
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 120
      runas: root
#!/bin/bash
# scripts/health_check.sh

APP_URL="http://localhost:8080/health"
MAX_RETRIES=10
RETRY_INTERVAL=5

echo "[HEALTH] Starting health check for $APP_URL"

for i in $(seq 1 $MAX_RETRIES); do
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" $APP_URL 2>/dev/null)
    
    if [ "$HTTP_CODE" = "200" ]; then
        echo "[HEALTH] ✅ Health check passed on attempt $i (HTTP $HTTP_CODE)"
        exit 0
    fi
    
    echo "[HEALTH] Attempt $i/$MAX_RETRIES: HTTP $HTTP_CODE (waiting ${RETRY_INTERVAL}s...)"
    sleep $RETRY_INTERVAL
done

echo "[HEALTH] ❌ Health check failed after $MAX_RETRIES attempts"
exit 1

创建 CloudWatch Alarm

aws cloudwatch put-metric-alarm \
  --alarm-name mfmsapp-http-5xx-high \
  --namespace "AWS/ApplicationELB" \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/mfmsapp-alb/xxxxx \
  --statistic Sum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts \
  --treat-missing-data notBreaching

坑 4:构建失败后无法重试

现象

CodeBuild 构建因临时问题失败(Docker Hub 限流、NPM Registry 超时),但 CodePipeline 不会自动重试,你得手动点"Release change"重新跑整个流水线。

解决方案

方案 A:buildspec 内置重试逻辑

phases:
  install:
    commands:
      - |
        retry() {
          local n=0
          local max=3
          local cmd=$@
          until [ $n -ge $max ]; do
            $cmd && break
            n=$((n+1))
            echo "[RETRY] Attempt $n/$max failed. Retrying in 5s..."
            sleep 5
          done
          if [ $n -ge $max ]; then
            echo "[RETRY] ❌ All $max attempts failed for: $cmd"
            exit 1
          fi
        }
  build:
    commands:
      - retry go mod download
      - retry docker pull golang:1.22

方案 B:CodePipeline 自动重试配置

{
  "stageName": "Build",
  "retryMode": "FAILED_ACTIONS",
  "maxRetries": 2
}

在 Pipeline 定义中配置:

aws codepipeline get-pipeline --name mfmsapp-pipeline > pipeline.json

# 在 stages 数组的 Build stage 中加入 retryConfiguration
# 然后更新:
aws codepipeline update-pipeline --cli-input-json file://pipeline.json

方案 C:Lambda 自动重试

import boto3

codepipeline = boto3.client('codepipeline')

def lambda_handler(event, context):
    # EventBridge 捕获失败事件
    detail = event['detail']
    pipeline_name = detail['pipeline']
    execution_id = detail['execution-id']
    
    # 检查是否已重试过(防止无限循环)
    response = codepipeline.get_pipeline_execution(
        pipelineName=pipeline_name,
        executionId=execution_id
    )
    
    if response['pipelineExecution']['trigger'].get('triggerType') == 'Retry':
        print(f"Already a retry, skipping. Execution: {execution_id}")
        return
    
    # 获取失败的 stage
    stages = response['pipelineExecution'].get('stageStates', [])
    failed_stage = next((s for s in stages if s.get('latestExecution', {}).get('status') == 'Failed'), None)
    
    if failed_stage:
        stage_name = failed_stage['stageName']
        print(f"Retrying failed stage: {stage_name}")
        
        # 重试该 stage
        codepipeline.retry_stage_execution(
            pipelineName=pipeline_name,
            executionId=execution_id,
            stageName=stage_name,
            retryMode='FAILED_ACTIONS'
        )

坑 5:监控指标不全面

现象

你只监控了"流水线是否成功",但忽略了:

  • 构建时间是否越来越长(依赖膨胀、缓存失效)

  • 部署频率是否下降(团队不敢部署)

  • 失败率是否上升(代码质量下降)

解决方案:DORA 指标监控

用 CloudWatch Metrics + 自定义指标追踪 CI/CD 健康度:

import boto3
from datetime import datetime, timedelta

cloudwatch = boto3.client('cloudwatch')
codepipeline = boto3.client('codepipeline')

def publish_cicd_metrics():
    now = datetime.utcnow()
    one_week_ago = now - timedelta(days=7)
    
    # 获取过去一周的执行记录
    executions = codepipeline.list_pipeline_executions(
        pipelineName='mfmsapp-pipeline',
        startTimeBefore=now,
        startTimeAfter=one_week_ago
    )
    
    total = len(executions['pipelineExecutionSummaries'])
    succeeded = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Succeeded')
    failed = sum(1 for e in executions['pipelineExecutionSummaries'] if e['status'] == 'Failed')
    
    success_rate = (succeeded / total * 100) if total > 0 else 0
    
    # 发布指标
    cloudwatch.put_metric_data(
        Namespace='Custom/CICD',
        MetricData=[
            {
                'MetricName': 'DeploymentFrequency',
                'Value': total,
                'Unit': 'Count',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            },
            {
                'MetricName': 'ChangeFailureRate',
                'Value': 100 - success_rate,
                'Unit': 'Percent',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            },
            {
                'MetricName': 'SuccessfulDeployments',
                'Value': succeeded,
                'Unit': 'Count',
                'Dimensions': [{'Name': 'Pipeline', 'Value': 'mfmsapp'}]
            }
        ]
    )
    
    print(f"DORA Metrics: Total={total}, Succeeded={succeeded}, Failed={failed}, SuccessRate={success_rate:.1f}%")

配置告警阈值:

# 成功率低于 80% 告警
aws cloudwatch put-metric-alarm \
  --alarm-name mfmsapp-deployment-success-rate-low \
  --namespace "Custom/CICD" \
  --metric-name ChangeFailureRate \
  --dimensions Name=Pipeline,Value=mfmsapp \
  --statistic Average \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold 20 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:mfmsapp-cicd-alerts

监控 Dashboard 模板

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "Pipeline Execution Status",
        "metrics": [
          ["AWS/CodePipeline", "PipelineExecutionSuccess", "PipelineName", "mfmsapp-pipeline"],
          [".", "PipelineExecutionFailure", ".", "."]
        ],
        "period": 300,
        "stat": "Sum"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "Build Duration",
        "metrics": [
          ["AWS/CodeBuild", "Duration", "ProjectName", "mfmsapp-build"]
        ],
        "period": 300,
        "stat": "Average"
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "DORA - Deployment Frequency & Failure Rate",
        "metrics": [
          ["Custom/CICD", "DeploymentFrequency", "Pipeline", "mfmsapp"],
          [".", "ChangeFailureRate", ".", "."]
        ],
        "period": 86400,
        "stat": "Average"
      }
    },
    {
      "type": "log",
      "properties": {
        "title": "Recent Errors",
        "query": "SOURCE '/aws/codebuild/mfmsapp' | fields @timestamp, @message | filter @message like /ERROR|FAIL/ | sort @timestamp desc | limit 20",
        "region": "ap-northeast-1",
        "stacked": false
      }
    }
  ]
}
aws cloudwatch put-dashboard \
  --dashboard-name mfmsapp-cicd \
  --dashboard-body file://dashboard.json

故障排查决策树

现象

首先检查

常见原因

快速修复

流水线红叉

哪个阶段失败

-

点击阶段查看日志

Source 阶段失败

CodeCommit/S3 连接

权限过期、仓库不存在

检查 EventBridge 规则和 IAM

Build 阶段失败

CodeBuild 日志

编译错误、依赖下载失败

重试或修复 buildspec

Deploy 阶段失败

CodeDeploy 日志

健康检查失败、脚本超时

检查 appspec.yml 和脚本

构建超时

CodeBuild Duration

依赖膨胀、缓存失效

配置 S3 缓存或增大 timeout

部署后 5xx 飙升

CloudWatch Alarm

应用 bug、配置错误

自动回滚或手动回滚


总结

CI/CD 监控的核心是四个能力

  1. 能看:统一日志结构 + CloudWatch Insights 快速搜索

  2. 能收:EventBridge + SNS 实时通知

  3. 能滚:CodeDeploy 自动回滚 + 健康检查

  4. 能量:DORA 指标追踪 CI/CD 健康度

做到这四点,你的流水线才算"生产就绪"。


相关文档


下一篇: 系列 09:进阶技巧:多环境 + 跨区域部署 —— 参数管理、蓝绿部署、Artifact 复制、并行构建,让你的 CI/CD 从"能用"进化到"好用"。