0x00 前提

APC BK650M2到货, 不知道是我主板支持问题还是配置情况, apcupsd和win下官方软件都识别不到ups, 幸好NUT能用, 以下分享一下我的配置.另, 看网上有人反馈apcupsd不适用bk650m2(链接), 所以还是直接用nut好了.

我们的目标: 1. 断电后 TrueNAS先关机; 然后PVE关机; 2. 来电后重启(最好能有延迟)

问题: 原计划是使用upsmon -c fsd, 测试发现, primary节点也能收到FSD消息, 直接进入关机流程了, 和预想(断电X秒没有恢复供电, 通知所有虚拟机, 然后主节点再关机)不一样,也可能是我配置有问题, 目前方案是: 虚拟机设置使用电池的60秒后关机, primary设置90秒关机, 和预期效果差不多吧, 其实直接在primary配置关机执行qm shutdown <vmid> 更简单, 反正也是想试试看起来高级的东西.

此文章大部分参考官方文档Network UPS Tools

我的配置

  1. NAS: H11SSL-i + EPYC 7282
  2. APC BK650M2-CH 1个, 多个配置可能稍微复杂一点
  3. 系统环境: PVE + TrueNAS

安装

1
2
apt install nut-server nut-client

nut-scanner -U 可以扫描已连接的ups

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
root@pve00:~# nut-scanner -U
SNMP library not found. SNMP search disabled.
Neon library not found. XML search disabled.
IPMI library not found. IPMI search disabled.
Scanning USB bus.
No start IP, skipping NUT bus (old connect method)
[nutdev1]
driver = "usbhid-ups"
port = "auto"
vendorid = "051D"
productid = "0002"
product = "Back-UPS BK650M2-CH FW:294803G -292804G"
serial = "9xxxxxxxxxxx"
vendor = "American Power Conversion"
bus = "001"

开始配置

UPS驱动配置 ups.conf

1
2
3
4
5
root@pve00:~# cat /etc/nut/ups.conf
[APC]
driver = usbhid-ups
port = auto
desc = "APC BK650M2-CH"

配置完成后, 执行upsdrvctl start就能生效ups了,然后就可以直接看一下我们当前UPS的状态了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
root@pve00:~# upsdrvctl start
Network UPS Tools - UPS driver controller 2.7.4
Network UPS Tools - Generic HID driver 0.41 (2.7.4)
USB communication driver 0.33
Using subdriver: APC HID 0.96

root@pve00:~# upsc APC@localhost:3493
Init SSL without certificate database
battery.charge: 100
battery.charge.low: 10
battery.mfr.date: 2001/01/01
battery.runtime: 2403
battery.runtime.low: 120
battery.type: PbAc
battery.voltage: 13.6
battery.voltage.nominal: 12.0
device.mfr: American Power Conversion
device.model: Back-UPS BK650M2-CH
device.serial: 9xxxxxxxxxx
device.type: ups
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.synchronous: no
driver.version: 2.7.4
driver.version.data: APC HID 0.96
driver.version.internal: 0.41
input.sensitivity: low
input.transfer.high: 278
input.transfer.low: 160
input.voltage: 226.0
input.voltage.nominal: 220
ups.beeper.status: disabled
ups.delay.shutdown: 20
ups.firmware: 294803G -292804G
ups.load: 11
ups.mfr: American Power Conversion
ups.mfr.date: 2022/05/06
ups.model: Back-UPS BK650M2-CH
ups.productid: 0002
ups.realpower.nominal: 390
ups.serial: 9xxxxxxxxxx
ups.status: OL
ups.test.result: No test initiated
ups.timer.reboot: 0
ups.timer.shutdown: -1
ups.vendorid: 051d

服务器配置

配置两个用户一个管理员一个用于虚拟机(远程upsmon)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@pve00:~# cat /etc/nut/upsd.conf
LISTEN 0.0.0.0 3493

root@pve00:~# cat /etc/nut/upsd.users
[admin]
password = admin
actions = SET FSD
instcmds = ALL
upsmon master

[user]
password = normal
upsmon slave

官方建议权限

1
2
chown root:nut upsd.conf upsd.users 
chmod 0640 upsd.conf upsd.users

可以测试一下,实际可以直接systemctl restart nut-server启动

1
2
3
4
5
6
root@pve00:/etc/nut# upsd
Network UPS Tools upsd 2.7.4
fopen /run/nut/upsd.pid: No such file or directory
listening on 0.0.0.0 port 3493
Connected to UPS [APC]: usbhid-ups-APC

1
2
3
git remote set-branches origin dev
git fetch --depth 1 origin dev
git checkout -b dev origin/dev

关机流程

首先看一下需要关机的流程:
upsmon secondary systems see “FSD” and:

  1. generate a NOTIFY_SHUTDOWN event
  2. wait FINALDELAY seconds(typically 5)
  3. call SHUTDOWNCMD
  4. disconnect from upsd

然后是服务器的upsmon, 会先等待HOSTSYNC秒, 如果在此时间之后仍有连接, 将忽略已有连接继续关闭.

The upsmon primary:

  1. generates a NOTIFY_SHUTDOWN event
  2. waits FINALDELAY seconds(typically 5)
  3. creates the POWERDOWNFLAG file in its local filesystem
  4. calls the SHUTDOWNCMD

结论: 所以我们要保证虚拟机的所有关机时间小于primary的HOSTSYNC+FINALDELAY

配置自动关机(upsmon)

https://networkupstools.org/docs/man/upsmon.conf.html

首先是PVE上的配置

30秒等待虚拟机关机;再等待10秒后关机

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
root@pve00:~# cat /etc/nut/upsmon.conf |grep -v "^#"

#### 修改/新增的参数 ####

MONITOR APC@localhost:3494 1 admin AdMiN primary

FINALDELAY 10

HOSTSYNC 30

NOTIFYFLAG ONLINE SYSLOG+WALL+EXEC
NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC
NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC
NOTIFYFLAG FSD SYSLOG+WALL+EXEC

NOTIFYCMD /usr/sbin/upssched
# NOTIFYFLAG COMMOK SYSLOG+WALL
# NOTIFYFLAG COMMBAD SYSLOG+WALL
# NOTIFYFLAG SHUTDOWN SYSLOG+WALL
# NOTIFYFLAG REPLBATT SYSLOG+WALL
# NOTIFYFLAG NOCOMM SYSLOG+WALL
# NOTIFYFLAG NOPARENT SYSLOG+WALL

#### 这些可以用默认配置 #####

FINALDELAY 50

MINSUPPLIES 1

SHUTDOWNCMD "/sbin/shutdown -h +0"

POLLFREQ 5

POLLFREQALERT 5

DEADTIME 15

POWERDOWNFLAG /etc/killpower

RBWARNTIME 43200

NOCOMMWARNTIME 300

upsmon —> 调用 upssched —> 调用你的 CMDSCRIPT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@pve00:/etc/nut# cat upssched.conf |grep -v "^#"

CMDSCRIPT /bin/upssched-cmd

PIPEFN /run/nut/upssched/upssched.pipe

LOCKFN /run/nut/upssched/upssched.lock

##### 90秒不恢复电力就关机 #####
AT ONBATT * START-TIMER power-off 90
AT LOWBATT * START-TIMER power-off 10

##### 来电恢复 #####
AT ONLINE * CANCEL-TIMER power-off

1
2
3
4
5
6
7
8
9
10
11
12
13
root@pve00:/etc/nut# cat /bin/upssched-cmd |grep -v "^#"

#! /bin/sh
case $1 in
power-off)
logger -t upssched-cmd "Power OFF right now"
/usr/sbin/upsmon -c fsd
;;
*)
logger -t upssched-cmd "Unrecognized command: $1"
;;
esac

systemctl enable nut-monitor

然后是TrueNAS上的配置
页面配置: UPS配置和用户都填写主节点里的配置, 选择SLAVE, 设置电池供电后60秒关机

测试

  1. 手动执行 upmon -c fsd: pve, TrueNAS全部关机了
  2. 断电: 可以达到理想效果

总结

调试下来发现, 实际上不用很复杂的配置, 依赖系统的服务基本够用
最终效果:

  1. 市电断开, 收到ONBATT
  2. TrueNAS 60秒后执行关机
  3. PVE 90秒后执行upsmon -s fsd, 会正常走关机流程
  4. 关机

这样只有nas关机了, 因为主板有ipmi, 所以还会有部分耗电, 后面再研究一下怎么直接给ups断电

另外, 不用配置关闭虚拟机的脚本:
pve-manager.service有配置stop任务: ExecStop=/usr/bin/pvesh --nooutput create /nodes/localhost/stopall

TODO

  1. 低电量阈值
  2. 停电N秒内不重启
  3. 发送邮件

参考

评论