0x00 前提 APC BK650M2到货, 不知道是我主板支持问题还是配置情况, apcupsd和win下官方软件都识别不到ups, 幸好NUT能用, 以下分享一下我的配置.另, 看网上有人反馈apcupsd不适用bk650m2(链接 ), 所以还是直接用nut好了.
我们的目标: 1. 断电后 TrueNAS先关机; 然后PVE关机; 2. 来电后重启(最好能有延迟)
问题: 原计划是使用upsmon -c fsd
, 测试发现, primary节点也能收到FSD消息, 直接进入关机流程了, 和预想(断电X秒没有恢复供电, 通知所有虚拟机, 然后主节点再关机)不一样,也可能是我配置有问题, 目前方案是: 虚拟机设置使用电池的60秒后关机, primary设置90秒关机, 和预期效果差不多吧, 其实直接在primary配置关机执行qm shutdown <vmid>
更简单, 反正也是想试试看起来高级的东西.
此文章大部分参考官方文档Network UPS Tools
我的配置
NAS: H11SSL-i + EPYC 7282
APC BK650M2-CH 1个, 多个配置可能稍微复杂一点
系统环境: PVE + TrueNAS
安装 1 2 apt install nut-server nut-client
nut-scanner -U
可以扫描已连接的ups
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 root@pve00:~# nut-scanner -U SNMP library not found. SNMP search disabled. Neon library not found. XML search disabled. IPMI library not found. IPMI search disabled. Scanning USB bus. No start IP, skipping NUT bus (old connect method) [nutdev1] driver = "usbhid-ups" port = "auto" vendorid = "051D" productid = "0002" product = "Back-UPS BK650M2-CH FW:294803G -292804G" serial = "9xxxxxxxxxxx" vendor = "American Power Conversion" bus = "001"
开始配置 UPS驱动配置 ups.conf 1 2 3 4 5 root@pve00:~# cat /etc/nut/ups.conf [APC] driver = usbhid-ups port = auto desc = "APC BK650M2-CH"
配置完成后, 执行upsdrvctl start
就能生效ups了,然后就可以直接看一下我们当前UPS的状态了
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 root@pve00:~# upsdrvctl start Network UPS Tools - UPS driver controller 2.7.4 Network UPS Tools - Generic HID driver 0.41 (2.7.4) USB communication driver 0.33 Using subdriver: APC HID 0.96 root@pve00:~# upsc APC@localhost:3493 Init SSL without certificate database battery.charge: 100 battery.charge.low: 10 battery.mfr.date: 2001/01/01 battery.runtime: 2403 battery.runtime.low: 120 battery.type: PbAc battery.voltage: 13.6 battery.voltage.nominal: 12.0 device.mfr: American Power Conversion device.model: Back-UPS BK650M2-CH device.serial: 9xxxxxxxxxx device.type: ups driver.name: usbhid-ups driver.parameter.pollfreq: 30 driver.parameter.pollinterval: 2 driver.parameter.port: auto driver.parameter.synchronous: no driver.version: 2.7.4 driver.version.data: APC HID 0.96 driver.version.internal: 0.41 input.sensitivity: low input.transfer.high: 278 input.transfer.low: 160 input.voltage: 226.0 input.voltage.nominal: 220 ups.beeper.status: disabled ups.delay.shutdown: 20 ups.firmware: 294803G -292804G ups.load: 11 ups.mfr: American Power Conversion ups.mfr.date: 2022/05/06 ups.model: Back-UPS BK650M2-CH ups.productid: 0002 ups.realpower.nominal: 390 ups.serial: 9xxxxxxxxxx ups.status: OL ups.test.result: No test initiated ups.timer.reboot: 0 ups.timer.shutdown: -1 ups.vendorid: 051d
服务器配置 配置两个用户一个管理员一个用于虚拟机(远程upsmon)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 root@pve00:~# cat /etc/nut/upsd.conf LISTEN 0.0.0.0 3493 root@pve00:~# cat /etc/nut/upsd.users [admin] password = admin actions = SET FSD instcmds = ALL upsmon master [user] password = normal upsmon slave
官方建议权限
1 2 chown root:nut upsd.conf upsd.users chmod 0640 upsd.conf upsd.users
可以测试一下,实际可以直接systemctl restart nut-server
启动
1 2 3 4 5 6 root@pve00:/etc/nut# upsd Network UPS Tools upsd 2.7.4 fopen /run/nut/upsd.pid: No such file or directory listening on 0.0.0.0 port 3493 Connected to UPS [APC]: usbhid-ups-APC
1 2 3 git remote set-branches origin dev git fetch --depth 1 origin dev git checkout -b dev origin/dev
关机流程 首先看一下需要关机的流程: upsmon secondary systems see “FSD” and:
generate a NOTIFY_SHUTDOWN event
wait FINALDELAY seconds(typically 5)
call SHUTDOWNCMD
disconnect from upsd
然后是服务器的upsmon, 会先等待HOSTSYNC秒, 如果在此时间之后仍有连接, 将忽略已有连接继续关闭.
The upsmon primary:
generates a NOTIFY_SHUTDOWN event
waits FINALDELAY seconds(typically 5)
creates the POWERDOWNFLAG file in its local filesystem
calls the SHUTDOWNCMD
结论: 所以我们要保证虚拟机的所有关机时间小于primary的HOSTSYNC+FINALDELAY
配置自动关机(upsmon) https://networkupstools.org/docs/man/upsmon.conf.html
首先是PVE上的配置
30秒等待虚拟机关机;再等待10秒后关机
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 root@pve00:~# cat /etc/nut/upsmon.conf |grep -v "^#" # MONITOR APC@localhost:3494 1 admin AdMiN primary FINALDELAY 10 HOSTSYNC 30 NOTIFYFLAG ONLINE SYSLOG+WALL+EXEC NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC NOTIFYFLAG FSD SYSLOG+WALL+EXEC NOTIFYCMD /usr/sbin/upssched # NOTIFYFLAG COMMOK SYSLOG+WALL # NOTIFYFLAG COMMBAD SYSLOG+WALL # NOTIFYFLAG SHUTDOWN SYSLOG+WALL # NOTIFYFLAG REPLBATT SYSLOG+WALL # NOTIFYFLAG NOCOMM SYSLOG+WALL # NOTIFYFLAG NOPARENT SYSLOG+WALL # FINALDELAY 50 MINSUPPLIES 1 SHUTDOWNCMD "/sbin/shutdown -h +0" POLLFREQ 5 POLLFREQALERT 5 DEADTIME 15 POWERDOWNFLAG /etc/killpower RBWARNTIME 43200 NOCOMMWARNTIME 300
upsmon —> 调用 upssched —> 调用你的 CMDSCRIPT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 root@pve00:/etc/nut# cat upssched.conf |grep -v "^#" CMDSCRIPT /bin/upssched-cmd PIPEFN /run/nut/upssched/upssched.pipe LOCKFN /run/nut/upssched/upssched.lock # AT ONBATT * START-TIMER power-off 90 AT LOWBATT * START-TIMER power-off 10 # AT ONLINE * CANCEL-TIMER power-off
1 2 3 4 5 6 7 8 9 10 11 12 13 root@pve00:/etc/nut# cat /bin/upssched-cmd |grep -v "^#" # ! /bin/sh case $1 in power-off) logger -t upssched-cmd "Power OFF right now" /usr/sbin/upsmon -c fsd ;; *) logger -t upssched-cmd "Unrecognized command: $1" ;; esac
systemctl enable nut-monitor
然后是TrueNAS上的配置 页面配置: UPS配置和用户都填写主节点里的配置, 选择SLAVE, 设置电池供电后60秒关机
测试
手动执行 upmon -c fsd: pve, TrueNAS全部关机了
断电: 可以达到理想效果
总结 调试下来发现, 实际上不用很复杂的配置, 依赖系统的服务基本够用 最终效果:
市电断开, 收到ONBATT
TrueNAS 60秒后执行关机
PVE 90秒后执行upsmon -s fsd
, 会正常走关机流程
关机
这样只有nas关机了, 因为主板有ipmi, 所以还会有部分耗电, 后面再研究一下怎么直接给ups断电
另外, 不用配置关闭虚拟机的脚本:pve-manager.service
有配置stop任务: ExecStop=/usr/bin/pvesh --nooutput create /nodes/localhost/stopall
TODO
低电量阈值
停电N秒内不重启
发送邮件
参考