云上运维的复杂程度,会随着企业上云进程的深入而呈现指数级增长。从过去“管理物理机”到如今“管理云资源”,运维工程师所面对的挑战已然截然不同。在阿里云环境下,以下几个典型痛点几乎每天都在反复上演:
首先是ECS实例管理难题。面对100台ECS,如果需要逐台登录查看运行状态、修改配置参数,单次操作往往就要耗费2到3个小时。尽管阿里云控制台比传统SSH登录更为便捷,但在批量操作场景下,逐一点击依然显得格外繁琐。
OSS存储操作同样令人头疼。日常运维中,需要定期清理过期文件、执行跨区域数据同步、配置生命周期策略。每次都要打开OSS控制台,逐层导航到具体Bucket,再手动选择文件——这种重复性劳动严重拉低了工作效率。
RDS备份管理更是分散耗时。假设你负责20个RDS实例的备份策略配置、备份完整性验证以及跨区域复制任务,每周例行检查一次,每个实例耗时约5分钟,累计就是100分钟。这还不包括临时出现的问题处理时间。
资源审计与报表生成呢?定期汇总ECS、RDS、OSS等各类资源的使用情况与费用明细,手动从控制台导出数据后再进行整理,整个过程令人十分崩溃。
权限管理更是不容忽视,RAM子账号的权限分配、安全组规则的调整,人工操作极易出现遗漏或配置错误,一旦出现问题,影响范围往往非常大。
经过长时间的一线实战,我们总结出5个核心技巧,成功将云运维效率提升了70%以上。下面逐一进行详细拆解。
1. 场景:云上运维的效率瓶颈
随着企业上云不断深入,运维工程师的挑战已从“管理物理机”全面转向“管理云资源”。在阿里云环境中,以下几类痛点表现得尤为突出:
ECS实例管理效率低下。 面对100台ECS实例,逐台登录检查状态、修改配置,每次操作需要2到3个小时。虽然通过阿里云控制台操作比SSH登录方便一些,但在批量处理场景下,反复点击依然费时费力。
OSS存储操作流程繁琐。 日常运维工作包含定期清理OSS过期文件、跨区域同步数据、设置生命周期策略。每次操作都要打开OSS控制台,导航到指定Bucket,再手动选择文件——重复劳动问题非常严重。
RDS备份管理较为分散。 假设需要负责20个RDS实例的备份策略配置、备份验证与跨区域复制。每周检查一次,每个实例5分钟,累计就要花费100分钟。
资源审计与报表生成困难。 需要定期统计ECS、RDS、OSS等资源的使用状况与费用数据,手动从控制台导出再整理,整体效率很低。
权限管理容易出错。 RAM子账号权限配置、安全组规则变更等操作,人工执行时很容易遗漏或配置不当。
这些痛点促使我们深入研究了阿里云CLI与批处理集成的解决方案。经过6个月的实践摸索,我们总结出5个核心技巧,成功将云运维效率提升了70%以上。
2. 阿里云CLI核心原理
☁️ 2.1 架构概览

2.2 阿里云CLI安装与配置
# 安装阿里云CLI
curl -fsSL https://aliyuncli.alicdn.com/install.sh | bash
# 验证安装
aliyun --version
# 配置AK和区域
aliyun configure
# 交互式配置示例:
# Access Key ID [****]: LTAI5t...
# Access Key Secret [****]: your-secret
# Default Region Id []: cn-hangzhou
# Default Output Format [json]: json
2.3 核心工具对比
| 工具 | 适用场景 | 安装方式 | 学习成本 | 推荐指数 |
|---|---|---|---|---|
| aliyun-cli | 云产品API调用 | 一键安装 | 低 | ⭐⭐⭐⭐⭐ |
| ossutil | OSS对象存储操作 | 下载即用 | 低 | ⭐⭐⭐⭐⭐ |
| terraform | 基础设施即代码 | 二进制安装 | 中 | ⭐⭐⭐⭐ |
| aliyun-sdk | 自定义开发 | pip/npm安装 | 中 | ⭐⭐⭐ |
Terraform配置示例:
# main.tf - 阿里云ECS实例自动化创建
terraform {
required_providers {
alicloud = {
source = "aliyun/alicloud"
version = "~> 1.220"
}
}
}
provider "alicloud" {
region = "cn-hangzhou"
}
# 查询可用区
data "alicloud_zones" "default" {
a vailable_resource_creation = "VSwitch"
}
# 创建VPC
resource "alicloud_vpc" "default" {
vpc_name = "terraform-vpc"
cidr_block = "172.16.0.0/12"
}
# 创建交换机
resource "alicloud_vswitch" "default" {
vpc_id = alicloud_vpc.default.id
cidr_block = "172.16.0.0/24"
zone_id = data.alicloud_zones.default.zones[0].id
}
# 创建安全组
resource "alicloud_security_group" "default" {
name = "terraform-sg"
vpc_id = alicloud_vpc.default.id
}
resource "alicloud_security_group_rule" "allow_ssh" {
type = "ingress"
ip_protocol = "tcp"
port_range = "22/22"
security_group_id = alicloud_security_group.default.id
cidr_ip = "10.0.0.0/8"
}
# 批量创建ECS实例
resource "alicloud_instance" "web" {
count = 5
instance_name = "web-${count.index + 1}"
image_id = "centos_7_9_x64_20G_alibase_20240101.vhd"
instance_type = "ecs.c6.large"
security_groups = [alicloud_security_group.default.id]
vswitch_id = alicloud_vswitch.default.id
system_disk_size = 40
system_disk_category = "cloud_essd"
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
# 输出实例信息
output "instance_ids" {
value = alicloud_instance.web[*].id
}
output "private_ips" {
value = alicloud_instance.web[*].private_ip
}
3. 实战技巧一:ECS批量管理自动化
3.1 问题分析
在日常运维中,100台ECS实例的常规管理任务——如开机、关机、重启、更换系统盘、修改安全组规则等——如果通过控制台逐个操作,不仅耗时巨大,还很容易遗漏某些实例。
3.2 解决方案
借助aliyun-cli配合Shell脚本,可以轻松实现ECS实例的批量管理:
#!/bin/bash
# ecs_batch_manager.sh - ECS实例批量管理脚本
set -euo pipefail
# ============================================================================
# 配置
# ============================================================================
REGION="cn-hangzhou"
TAG_KEY="Environment"
TAG_VALUES=("production" "staging" "testing")
OUTPUT_DIR="/tmp/ecs_reports"
# 颜色输出
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
mkdir -p "$OUTPUT_DIR"
log() {
local level=$1
shift
echo -e "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*"
}
# ============================================================================
# 获取ECS实例列表
# ============================================================================
get_ecs_instances() {
local tag_value=${1:-""}
if [[ -n "$tag_value" ]]; then
# 按标签过滤
aliyun ecs DescribeInstances --region "$REGION" --Tag "1.Key=$TAG_KEY,1.Value=$tag_value" --output json 2>/dev/null
else
# 获取所有实例
aliyun ecs DescribeInstances --region "$REGION" --output json 2>/dev/null
fi
}
# ============================================================================
# 格式化实例列表
# ============================================================================
list_instances() {
local tag_filter=${1:-""}
log INFO "获取ECS实例列表 (环境: ${tag_filter:-全部})..."
local instances
instances=$(get_ecs_instances "$tag_filter" | jq -r '.Instances.Instance[] | [.InstanceId, .InstanceName, .Status, .InstanceType, .PublicIpAddress.IpAddress[0] // "-", .Tags.Tag[]? | select(.TagKey == "'"$TAG_KEY"'") | .TagValue // "-"] | @tsv')
if [[ -z "$instances" ]]; then
log WARN "未找到ECS实例"
return
fi
echo ""
echo "=========================================="
echo " ECS实例列表 (${tag_filter:-全部})"
echo "=========================================="
printf "%-20s %-25s %-12s %-20s %-15s\n" "实例ID" "名称" "状态" "规格" "公网IP"
echo "------------------------------------------"
while IFS=$'\t' read -r id name status instance_type ip; do
printf "%-20s %-25s %-12s %-20s %-15s\n" "$id" "$name" "$status" "$instance_type" "$ip"
done <<< "$instances"
local count=$(echo "$instances" | grep -c . || echo "0")
echo "------------------------------------------"
echo "总计: $count 台实例"
echo "=========================================="
}
# ============================================================================
# 批量操作ECS实例
# ============================================================================
batch_operation() {
local action=$1 # start, stop, restart
local tag_filter=$2
log INFO "开始批量操作: $action (环境: ${tag_filter:-全部})..."
# 获取实例ID列表
local instances
instances=$(get_ecs_instances "$tag_filter" | jq -r '.Instances.Instance[].InstanceId')
if [[ -z "$instances" ]]; then
log WARN "未找到ECS实例"
return 1
fi
local count=0
local success=0
local failed=0
while IFS= read -r instance_id; do
((count++))
case $action in
start)
aliyun ecs StartInstance --region "$REGION" --InstanceId "$instance_id" >/dev/null && {
((success++))
log INFO "启动成功: $instance_id"
} || {
((failed++))
log ERROR "启动失败: $instance_id"
}
;;
stop)
aliyun ecs StopInstance --region "$REGION" --InstanceId "$instance_id" --StoppedMode StopCharging >/dev/null && {
((success++))
log INFO "停止成功(释放资源费): $instance_id"
} || {
((failed++))
log ERROR "停止失败: $instance_id"
}
;;
restart)
aliyun ecs RebootInstance --region "$REGION" --InstanceId "$instance_id" >/dev/null && {
((success++))
log INFO "重启成功: $instance_id"
} || {
((failed++))
log ERROR "重启失败: $instance_id"
}
;;
esac
done <<< "$instances"
echo ""
echo "=========================================="
echo "批量操作结果汇总"
echo "=========================================="
echo "操作: $action"
echo "总数: $count"
echo -e "成功: ${GREEN}$success${NC}"
echo -e "失败: ${RED}$failed${NC}"
echo "=========================================="
}
# ============================================================================
# 实例健康巡检
# ============================================================================
health_check() {
local tag_filter=${1:-""}
log INFO "开始ECS实例健康巡检..."
local instances
instances=$(get_ecs_instances "$tag_filter" | jq -r '.Instances.Instance[] | [.InstanceId, .InstanceName, .Status, .Cpu, .Memory, .OSType, .CreationTime] | @tsv')
local report_file="${OUTPUT_DIR}/ecs_health_$(date +%Y%m%d_%H%M%S).csv"
echo "实例ID,名称,状态,CPU(核),内存(GB),OS类型,创建时间" > "$report_file"
echo ""
echo "=========================================="
echo " ECS实例健康巡检报告"
echo "=========================================="
local warning_count=0
while IFS=$'\t' read -r id name status cpu memory os_type create_time; do
local status_icon
local cpu_icon
local mem_icon
case $status in
Running) status_icon="✅" ;;
Stopped) status_icon="⏹️" ;;
*) status_icon="❓" ;;
esac
# CPU核数检查
if [[ $cpu -lt 4 ]]; then
cpu_icon="⚠️"
((warning_count++))
else
cpu_icon="✓"
fi
printf "${status_icon} %-25s %-20s %-10s %-5s核 %-5sGB\n" "$id" "$name" "$status" "$cpu" "$memory"
echo "$id,$name,$status,$cpu,$memory,$os_type,$create_time" >> "$report_file"
done <<< "$instances"
local total=$(echo "$instances" | grep -c . || echo "0")
echo "------------------------------------------"
echo -e "总计: $total 台实例"
echo -e "警告: ${YELLOW}$warning_count${NC} 台配置过低"
echo -e "报告已保存: ${GREEN}$report_file${NC}"
echo "=========================================="
}
# ============================================================================
# 安全组检查
# ============================================================================
security_check() {
local instance_id=$1
log INFO "检查实例安全组: $instance_id"
# 获取实例关联的安全组
local security_groups
security_groups=$(aliyun ecs DescribeInstanceAttribute --region "$REGION" --InstanceId "$instance_id" 2>/dev/null | jq -r '.SecurityGroupIds.SecurityGroupId[]')
echo ""
echo "=========================================="
echo " 安全组规则检查"
echo "=========================================="
echo "实例ID: $instance_id"
echo ""
while IFS= read -r sg_id; do
echo "安全组: $sg_id"
echo "------------------------------"
# 获取安全组规则
aliyun ecs DescribeSecurityGroupAttribute --region "$REGION" --SecurityGroupId "$sg_id" 2>/dev/null | jq -r '.Permissions.Permission[] | select(.Direction == "ingress" and .Policy == "Accept") | [.IpProtocol, .PortRange, .SourceCidrIp, .Description // "-"] | @tsv' | while IFS=$'\t' read -r proto port source desc; do
# 高亮危险规则
if [[ "$source" == "0.0.0.0/0" && "$port" == "22/22" ]]; then
echo -e "${RED}⚠️ 危险:${NC} SSH端口全网开放"
elif [[ "$source" == "0.0.0.0/0" && "$port" == "3389/3389" ]]; then
echo -e "${RED}⚠️ 危险:${NC} RDP端口全网开放"
fi
printf "%-6s %-15s %-15s %s\n" "$proto" "$port" "$source" "$desc"
done
echo ""
done <<< "$security_groups"
echo "=========================================="
}
# ============================================================================
# 费用统计
# ============================================================================
cost_report() {
local month=${1:-$(date +%Y%m)}
log INFO "获取ECS费用统计 (月份: $month)..."
# 使用阿里云Billing API获取费用
local cost_data
cost_data=$(aliyun bssopenapi QueryInstanceBill --BillingCycle "$month" --ProductCode "ecs" --output json 2>/dev/null)
local total_cost
total_cost=$(echo "$cost_data" | jq -r '.Data.TotalOutstandingAmount // "0"')
local instance_count
instance_count=$(echo "$cost_data" | jq -r '.Data.TotalCount // 0')
echo ""
echo "=========================================="
echo " ECS费用统计 ($month)"
echo "=========================================="
echo "运行实例数: $instance_count"
echo "本月费用: ¥$total_cost"
echo ""
# 按规格统计
echo "按规格统计:"
echo "------------------------------"
echo "$cost_data" | jq -r '.Data.Items.Item[]? | [.InstanceConfig, .PretaxGrossAmount // "0"] | @tsv' 2>/dev/null | awk -F'\t' '{config[$1]++; cost[$1] += $2} END {for (c in config) {printf "%-30s =台 ¥%8.2f\n", c, config[c], cost[c]}}'
echo "=========================================="
}
# ============================================================================
# 主函数
# ============================================================================
main() {
local action=${1:-"help"}
shift || true
case $action in
list)
list_instances "${1:-}"
;;
start|stop|restart)
batch_operation "$action" "${1:-}"
;;
health)
health_check "${1:-}"
;;
security)
security_check "$1"
;;
cost)
cost_report "${1:-}"
;;
help|*)
echo "用法: $0 [options]"
echo ""
echo "Actions:"
echo " list [tag] - 列出ECS实例"
echo " start [tag] - 启动ECS实例"
echo " stop [tag] - 停止ECS实例(释放资源费)"
echo " restart [tag] - 重启ECS实例"
echo " health [tag] - 健康巡检"
echo " security - 安全组检查"
echo " cost [month] - 费用统计"
echo ""
echo "示例:"
echo " $0 list production"
echo " $0 health"
echo " $0 cost 202604"
;;
esac
}
main "$@"
3.3 踩坑案例
坑1:aliyun-cli凭证泄露
场景:将AK/SK直接硬编码在脚本中,提交到Git仓库后导致凭证泄露。
解决:推荐使用阿里云RAM角色配合OIDC免密方案:
# 错误做法:直接写在脚本中
export ALIBABA_CLOUD_ACCESS_KEY_ID="LTAI5t..."
export ALIBABA_CLOUD_ACCESS_KEY_SECRET="your-secret"
# 正确做法1:使用配置文件
aliyun configure --profile prod-profile
aliyun ecs DescribeInstances --profile prod-profile
# 正确做法2:使用RAM角色(推荐)
# 在ECS实例上绑定RAM角色
aliyun ecs AttachInstanceRamRole --RegionId cn-hangzhou --RamRoleName EcsAutomationRole --InstanceIds '["i-xxxxx"]'
# 之后无需配置AK/SK
aliyun ecs DescribeInstances
# 正确做法3:使用OIDC(CI/CD场景)
# 在GitHub Actions中配置
# export ALIBABA_CLOUD_ROLE_ARN=acs:ram::123456789:role/oidc-role
# export ALIBABA_CLOUD_OIDC_PROVIDER_ARN=acs:ram::123456789:oidc-provider/github-actions
坑2:批量操作触发限流
场景:同时对100台ECS执行操作,触发了阿里云API的限流机制。
解决:实现指数退避重试机制与并发控制:
#!/bin/bash
# 带退避的重试函数
call_with_retry() {
local max_retries=${1:-5}
local retry_count=0
local wait_time=1
shift
local cmd=("$@")
while [[ $retry_count -lt $max_retries ]]; do
if output=$("${cmd[@]}" 2>&1); then
echo "$output"
return 0
fi
((retry_count++))
wait_time=$((wait_time * 2))
log WARN "调用失败,${wait_time}秒后重试 ($retry_count/$max_retries)..."
sleep $((RANDOM % wait_time + 1)) # 随机延迟,避免雪崩
done
log ERROR "重试次数耗尽: ${cmd[*]}"
return 1
}
# 并发控制(每次最多10个请求)
batch_with_concurrency() {
local actions=("$@")
local max_concurrent=10
local running=0
for action in "${actions[@]}"; do
# 在后台执行
call_with_retry 3 $action &
((running++))
# 控制并发数
if [[ $running -ge $max_concurrent ]]; then
wait -n
((running--))
fi
done
# 等待剩余任务完成
wait
}
4. 实战技巧二:OSS存储批处理
4.1 问题分析
OSS对象存储的日常操作——包括文件上传、下载、删除以及生命周期管理——如果依赖控制台逐个处理,效率非常低下。
4.2 解决方案
利用ossutil工具可以高效地批量管理OSS存储:
#!/bin/bash
# oss_batch_manager.sh - OSS批量管理脚本
set -euo pipefail
# ============================================================================
# 配置
# ============================================================================
OSS_ENDPOINT="oss-cn-hangzhou.aliyuncs.com"
LOG_DIR="/var/log/oss_operations"
mkdir -p "$LOG_DIR"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "${LOG_DIR}/oss_batch.log"; }
# ============================================================================
# 批量上传
# ============================================================================
batch_upload() {
local source_dir=$1
local target_bucket=$2
local target_prefix=${3:-""}
local parallel=${4:-10}
log "开始批量上传: $source_dir -> oss://$target_bucket/$target_prefix"
ossutil cp "$source_dir" "oss://$target_bucket/$target_prefix" --recursive --parallel "$parallel" --update --include "*.log" --include "*.txt" --exclude "*.tmp" --jobs 5 --output-dir "$LOG_DIR"
log "上传完成"
}
# ============================================================================
# 批量下载
# ============================================================================
batch_download() {
local source_bucket=$1
local source_prefix=$2
local target_dir=$3
local days=${4:-7}
log "开始批量下载: oss://$source_bucket/$source_prefix -> $target_dir"
log "过滤条件: ${days}天内修改的文件"
ossutil cp "oss://$source_bucket/$source_prefix" "$target_dir" --recursive --update --include "*" --exclude "*.tmp" --exclude "*.bak" --output-dir "$LOG_DIR"
log "下载完成"
}
# ============================================================================
# 过期文件清理
# ============================================================================
cleanup_expired() {
local bucket=$1
local prefix=$2
local days=${3:-30}
log "清理过期文件: oss://$bucket/$prefix (${days}天前)"
# 列出并删除过期文件
ossutil rm "oss://$bucket/$prefix" --recursive --all-versions --update --output-dir "$LOG_DIR" | while read -r line; do
log "删除: $line"
done
# 设置生命周期策略
log "设置生命周期策略..."
cat > /tmp/lifecycle_rule.json << EOF
{
"rules": [
{
"id": "auto-cleanup",
"prefix": "$prefix",
"status": "Enabled",
"expiration": {
"days": $days
}
}
]
}
EOF
ossutil lifecycle --method put "oss://$bucket" /tmp/lifecycle_rule.json
log "清理完成"
}
# ============================================================================
# 跨区域同步
# ============================================================================
cross_region_sync() {
local source_bucket=$1
local target_bucket=$2
local prefix=${3:-""}
log "开始跨区域同步: $source_bucket -> $target_bucket"
ossutil cp "oss://$source_bucket/$prefix" "oss://$target_bucket/$prefix" --recursive --update --copy-props metadata --output-dir "$LOG_DIR"
log "同步完成"
}
# ============================================================================
# Bucket容量统计
# ============================================================================
bucket_stats() {
local bucket=$1
log "统计Bucket容量: $bucket"
echo ""
echo "=========================================="
echo " Bucket容量统计: $bucket"
echo "=========================================="
# 使用ossutil du命令
ossutil du "oss://$bucket" --output-dir "$LOG_DIR"
# 使用阿里云API获取详细信息
aliyun oss GetBucketStat --bucket "$bucket" --output json 2>/dev/null | jq -r '"文件数量: (.ObjectCount // 0) 个","存储容量: (.Storage | ./ (1024*1024*1024) | .*100 | round/100) GB","文件碎片: (.MultipartUploadCount // 0) 个","碎片容量: (.MultipartUploadStorage | ./ (1024*1024*1024) | .*100 | round/100) GB"' 2>/dev/null
echo "=========================================="
}
# ============================================================================
# 日志分析
# ============================================================================
analyze_logs() {
local bucket=$1
local log_prefix=${2:-"logs/"}
local output=$3
log "分析OSS访问日志: $bucket/$log_prefix"
# 下载日志
local tmp_dir=$(mktemp -d)
trap "rm -rf $tmp_dir" EXIT
ossutil cp "oss://$bucket/$log_prefix" "$tmp_dir/" --recursive --include "*.csv" --output-dir "$LOG_DIR"
# 分析日志
echo ""
echo "=========================================="
echo " 访问日志分析报告"
echo "=========================================="
echo ""
# 统计请求来源
echo "请求来源TOP 10:"
cat "$tmp_dir"/*.csv 2>/dev/null | awk -F',' '{print $1}' | sort | uniq -c | sort -rn | head -10
echo ""
echo "请求方法分布:"
cat "$tmp_dir"/*.csv 2>/dev/null | awk -F',' '{print $2}' | sort | uniq -c | sort -rn
echo ""
echo "HTTP状态码分布:"
cat "$tmp_dir"/*.csv 2>/dev/null | awk -F',' '{print $3}' | sort | uniq -c | sort -rn
echo ""
echo "=========================================="
echo "分析完成"
}
# ============================================================================
# 主函数
# ============================================================================
main() {
local action=${1:-"help"}
shift || true
case $action in
upload)
batch_upload "$@"
;;
download)
batch_download "$@"
;;
cleanup)
cleanup_expired "$@"
;;
sync)
cross_region_sync "$@"
;;
stats)
bucket_stats "$1"
;;
analyze)
analyze_logs "$@"
;;
help|*)
echo "用法: $0 [options]"
echo ""
echo "Actions:"
echo " upload [prefix] [parallel] - 批量上传"
echo " download [days] - 批量下载"
echo " cleanup [prefix] [days] - 清理过期文件"
echo " sync [prefix] - 跨区域同步"
echo " stats - 容量统计"
echo " analyze [prefix] [output] - 日志分析"
;;
esac
}
main "$@"
4.3 踩坑案例
坑1:OSS批量下载导致文件损坏
场景:批量下载大量小文件时,偶尔出现文件内容损坏的情况。
解决:使用ossutil内置的MD5校验功能来确保文件完整性:
# 开启MD5校验
ossutil cp "oss://bucket/prefix/" /local/dir/ --recursive --check-md5 --update
# 批量验证文件完整性
verify_oss_files() {
local bucket=$1
local prefix=$2
local local_dir=$3
log "验证文件完整性..."
find "$local_dir" -type f | while read -r file; do
local relative_path="${file#$local_dir/}"
local local_md5=$(md5sum "$file" | cut -d' ' -f1)
# 获取OSS上的ETag(即MD5)
local oss_md5
oss_md5=$(ossutil stat "oss://$bucket/$prefix/$relative_path" | grep "ETag:" | awk '{print $2}' | tr -d '"')
if [[ "$local_md5" != "$oss_md5" ]]; then
log ERROR "文件不匹配: $relative_path (本地: $local_md5, OSS: $oss_md5)"
else
log INFO "文件正常: $relative_path"
fi
done
}
5. 实战技巧三:RDS数据库运维自动化
5.1 问题分析
管理20个RDS实例,涉及备份策略配置、运行监控、参数调优等一系列操作,单纯依赖手工管理效率非常低下。
5.2 解决方案
通过aliyun-cli可以实现RDS运维的全面自动化:
#!/bin/bash
# rds_ops_manager.sh - RDS运维管理脚本
set -euo pipefail
REGION="cn-hangzhou"
LOG_DIR="/var/log/rds_ops"
mkdir -p "$LOG_DIR"
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "${LOG_DIR}/rds_ops.log"; }
# ============================================================================
# 获取RDS实例列表
# ============================================================================
get_rds_instances() {
aliyun rds DescribeDBInstances --region "$REGION" --output json 2>/dev/null | jq -r '.Items.DBInstance[] | [.DBInstanceId, .DBInstanceDescription, .DBInstanceStatus, .DBInstanceClass, .Engine, .EngineVersion] | @tsv'
}
# ============================================================================
# RDS健康巡检
# ============================================================================
rds_health_check() {
log "开始RDS实例健康巡检..."
local instances
instances=$(get_rds_instances)
local report_file="${LOG_DIR}/rds_health_$(date +%Y%m%d_%H%M%S).csv"
echo "实例ID,名称,状态,规格,引擎,版本,连接数,CPU,内存,磁盘" > "$report_file"
echo ""
echo "=========================================="
echo " RDS实例健康巡检"
echo "=========================================="
while IFS=$'\t' read -r id name status class engine version; do
# 获取实例资源使用
local resource_info
resource_info=$(aliyun rds DescribeResourceUsage --region "$REGION" --DBInstanceId "$id" 2>/dev/null)
local connections
connections=$(aliyun rds DescribeDBInstancePerformance --region "$REGION" --DBInstanceId "$id" --Key "MySQL_NetworkTraffic" --StartTime "$(date -d '-1 hour' +%Y-%m-%dT%H:%MZ)" --EndTime "$(date +%Y-%m-%dT%H:%MZ)" 2>/dev/null | jq -r '.PerformanceKeys.PerformanceKey[0].Values.PerformanceValue[0].Value // "0"')
printf "%-30s %-25s %-10s %-15s\n" "$id" "$name" "$status" "$class"
echo "$id,$name,$status,$class,$engine,$version,$connections" >> "$report_file"
done <<< "$instances"
echo "------------------------------------------"
echo "巡检报告已保存: $report_file"
echo "=========================================="
}
# ============================================================================
# 自动备份配置
# ============================================================================
configure_backup() {
local instance_id=$1
local backup_time=${2:-"03:00Z"} # UTC时间,即北京时间11:00
local retention_days=${3:-7}
log "配置备份策略: $instance_id"
# 设置备份策略
aliyun rds ModifyBackupPolicy --region "$REGION" --DBInstanceId "$instance_id" --PreferredBackupPeriod "Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday" --PreferredBackupTime "$backup_time" --BackupRetentionPeriod "$retention_days" --EnableBackupLog "1" --LogBackupRetentionPeriod "$retention_days" --output json 2>/dev/null
log "备份策略配置完成"
# 显示当前配置
aliyun rds DescribeBackupPolicy --region "$REGION" --DBInstanceId "$instance_id" | jq '{backup_time: .PreferredBackupTime, retention_days: .BackupRetentionPeriod, log_backup: .EnableBackupLog, log_retention: .LogBackupRetentionPeriod}'
}
# ============================================================================
# 慢查询分析
# ============================================================================
slow_query_analysis() {
local instance_id=$1
local hours=${2:-24}
log "分析慢查询: $instance_id (最近${hours}小时)"
# 获取慢查询报表
aliyun rds DescribeSlowLogRecords --region "$REGION" --DBInstanceId "$instance_id" --StartTime "$(date -d "-$hours hours" +%Y-%m-%dT%H:%MZ)" --EndTime "$(date +%Y-%m-%dT%H:%MZ)" --PageSize 100 --output json 2>/dev/null | jq -r '.Items.SQLSlowRecord[]? | [.HostAddress, .SQLText[:100], .QueryTimes, .LockTimes, .ParseRowCounts, .ReturnRowCounts, .ExecutionStartTime] | @tsv' | while IFS=$'\t' read -r host sql query_time lock_time parse_rows return_rows exec_time; do
# 标记慢查询级别
if [[ $query_time -gt 10 ]]; then
echo -e "