kernel default route inactive when installed after FRR 7.5 starts
Hi, I'm upgrading a bunch of Linux routers from CentOS 7 to Rocky 8, and as part of the upgrade, quagga seems to have been replaced by frr. For the most part, everything works fine, but I've encountered one problem. I've got a router that picks up a default route via DHCP from a cable modem. With quagga, this default route was accepted and redistributed via OSPF. But with FRR, it sometimes says that the route is "inactive", which horks my routing. I built the Fedora 34 quagga package and ran that and saw these results using quagga-1.2.4-17.el8.x86_64: Hello, this is Quagga (version 1.2.4). Copyright 1996-2005 Kunihiro Ishiguro, et al. ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, tag 0, vrf 0 Last update 00:00:13 ago
192.168.39.5, via lan0.9
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, tag 0, vrf 0, best, fib
* 207.237.112.1, via lan1
But with the standard frr-7.5-4.el8.x86_64.rpm, it sometimes marks the kernel route as inactive when it starts, and uses the ospf route instead: Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al. ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, best Last update 00:04:42 ago * 192.168.39.5, via lan0.9, weight 1 Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0 Last update 00:05:42 ago * 207.237.112.1, via lan1 inactive When it's working properly, typically after a restart, I see: Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al. ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103 Last update 00:00:01 ago 192.168.39.5, via lan0.9, weight 1 Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, best Last update 00:00:08 ago * 207.237.112.1, via lan1 My best guess is that there's some kind of timing issue here. When the system boots up with FRR, the FRR daemons start before DHCP installs the default route. That seems to lead to its being marked inactive. If I then restart FRR, it accepts the kernel default route. Is this perhaps fixed in a newer version of FRR? Or am I doing something stupid? Is there a patch for this? If not, I'm going to need to revert to quagga. Thanks, Andy
Can you add `debug zebra rib detail` to the top of your log file and recreate this issue? We should have special code that always allows the kernel route received over netlink. I would be interested in understanding what is going wrong. donald On Sat, Nov 13, 2021 at 6:04 PM Andrew J. Schorr < aschorr@telemetry-investments.com> wrote:
Hi,
I'm upgrading a bunch of Linux routers from CentOS 7 to Rocky 8, and as part of the upgrade, quagga seems to have been replaced by frr. For the most part, everything works fine, but I've encountered one problem. I've got a router that picks up a default route via DHCP from a cable modem. With quagga, this default route was accepted and redistributed via OSPF. But with FRR, it sometimes says that the route is "inactive", which horks my routing.
I built the Fedora 34 quagga package and ran that and saw these results using quagga-1.2.4-17.el8.x86_64:
Hello, this is Quagga (version 1.2.4). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, tag 0, vrf 0 Last update 00:00:13 ago
192.168.39.5, via lan0.9
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, tag 0, vrf 0, best, fib
* 207.237.112.1, via lan1
But with the standard frr-7.5-4.el8.x86_64.rpm, it sometimes marks the kernel route as inactive when it starts, and uses the ospf route instead:
Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, best Last update 00:04:42 ago * 192.168.39.5, via lan0.9, weight 1
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0 Last update 00:05:42 ago * 207.237.112.1, via lan1 inactive
When it's working properly, typically after a restart, I see:
Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103 Last update 00:00:01 ago 192.168.39.5, via lan0.9, weight 1
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, best Last update 00:00:08 ago * 207.237.112.1, via lan1
My best guess is that there's some kind of timing issue here. When the system boots up with FRR, the FRR daemons start before DHCP installs the default route. That seems to lead to its being marked inactive. If I then restart FRR, it accepts the kernel default route.
Is this perhaps fixed in a newer version of FRR? Or am I doing something stupid? Is there a patch for this? If not, I'm going to need to revert to quagga.
Thanks, Andy
_______________________________________________ frog mailing list frog@lists.frrouting.org https://lists.frrouting.org/listinfo/frog
Hi, I have attached two gzipped log files. The first shows the buggy case where the system boots up and FRR starts before DHCP acquires the default address. This gives an idea of the timing: [root@ti14 frr]# journalctl -b | egrep -i 'frr|dhcp' Nov 14 10:32:55 ti14 systemd[1]: Starting FRRouting... Nov 14 10:32:56 ti14 watchfrr[904]: watchfrr 7.5 starting: vty@0 Nov 14 10:32:56 ti14 watchfrr[904]: zebra state -> down : initial connection attempt failed Nov 14 10:32:56 ti14 watchfrr[904]: ospfd state -> down : initial connection attempt failed Nov 14 10:32:56 ti14 watchfrr[904]: staticd state -> down : initial connection attempt failed Nov 14 10:32:56 ti14 watchfrr[904]: Forked background command [pid 905]: /usr/lib/frr/watchfrr.sh restart all Nov 14 10:32:56 ti14 watchfrr.sh[913]: Cannot stop staticd: pid file not found Nov 14 10:32:56 ti14 watchfrr.sh[915]: Cannot stop ospfd: pid file not found Nov 14 10:32:56 ti14 watchfrr.sh[917]: Cannot stop zebra: pid file not found Nov 14 10:32:56 ti14 watchfrr[904]: zebra state -> up : connect succeeded Nov 14 10:32:56 ti14 watchfrr[904]: ospfd state -> up : connect succeeded Nov 14 10:32:56 ti14 watchfrr[904]: staticd state -> up : connect succeeded Nov 14 10:32:56 ti14 watchfrr[904]: all daemons up, doing startup-complete notify Nov 14 10:32:56 ti14 frrinit.sh[707]: Started watchfrr Nov 14 10:32:56 ti14 systemd[1]: Started FRRouting. Nov 14 10:33:00 ti14 dhclient[1823]: DHCPREQUEST on lan1 to 255.255.255.255 port 67 (xid=0x7fabd46c) Nov 14 10:33:07 ti14 dhclient[1823]: DHCPREQUEST on lan1 to 255.255.255.255 port 67 (xid=0x7fabd46c) Nov 14 10:33:21 ti14 dhclient[1823]: DHCPDISCOVER on lan1 to 255.255.255.255 port 67 interval 8 (xid=0xe1a98663) Nov 14 10:33:29 ti14 dhclient[1823]: DHCPDISCOVER on lan1 to 255.255.255.255 port 67 interval 10 (xid=0xe1a98663) Nov 14 10:33:39 ti14 dhclient[1823]: DHCPDISCOVER on lan1 to 255.255.255.255 port 67 interval 19 (xid=0xe1a98663) Nov 14 10:33:39 ti14 dhclient[1823]: DHCPREQUEST on lan1 to 255.255.255.255 port 67 (xid=0xe1a98663) Nov 14 10:33:39 ti14 dhclient[1823]: DHCPOFFER from 10.22.200.1 Nov 14 10:33:39 ti14 dhclient[1823]: DHCPACK from 10.22.200.1 (xid=0xe1a98663) The second logfile was from an FRR restart where the default kernel route was installed prior to FRR's startup. In that case, everything works properly. Would it be better to open a bug for this issue? Perhaps it's fixed in newer code. I tried the frr-8.1-02.el8.x86_64.rpm from your repo, but it frankly didn't work at all -- the ospf config was somehow not loaded. I didn't spend any time investigating; I guess there must be some major changes in the configuration language. It didn't seem worth much effort in view of the fact that quagga works properly. Regards, Andy On Sun, Nov 14, 2021 at 07:49:09AM -0500, Donald Sharp wrote:
Can you add `debug zebra rib detail` to the top of your log file and recreate this issue? We should have special code that always allows the kernel route received over netlink. I would be interested in understanding what is going wrong.
donald
On Sat, Nov 13, 2021 at 6:04 PM Andrew J. Schorr < aschorr@telemetry-investments.com> wrote:
Hi,
I'm upgrading a bunch of Linux routers from CentOS 7 to Rocky 8, and as part of the upgrade, quagga seems to have been replaced by frr. For the most part, everything works fine, but I've encountered one problem. I've got a router that picks up a default route via DHCP from a cable modem. With quagga, this default route was accepted and redistributed via OSPF. But with FRR, it sometimes says that the route is "inactive", which horks my routing.
I built the Fedora 34 quagga package and ran that and saw these results using quagga-1.2.4-17.el8.x86_64:
Hello, this is Quagga (version 1.2.4). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, tag 0, vrf 0 Last update 00:00:13 ago > 192.168.39.5, via lan0.9
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, tag 0, vrf 0, best, fib >* 207.237.112.1, via lan1
But with the standard frr-7.5-4.el8.x86_64.rpm, it sometimes marks the kernel route as inactive when it starts, and uses the ospf route instead:
Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103, best Last update 00:04:42 ago * 192.168.39.5, via lan0.9, weight 1
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0 Last update 00:05:42 ago * 207.237.112.1, via lan1 inactive
When it's working properly, typically after a restart, I see:
Hello, this is FRRouting (version 7.5). Copyright 1996-2005 Kunihiro Ishiguro, et al.
ti14# show ip route 0.0.0.0/0 Routing entry for 0.0.0.0/0 Known via "ospf", distance 110, metric 103 Last update 00:00:01 ago 192.168.39.5, via lan0.9, weight 1
Routing entry for 0.0.0.0/0 Known via "kernel", distance 0, metric 0, best Last update 00:00:08 ago * 207.237.112.1, via lan1
My best guess is that there's some kind of timing issue here. When the system boots up with FRR, the FRR daemons start before DHCP installs the default route. That seems to lead to its being marked inactive. If I then restart FRR, it accepts the kernel default route.
Is this perhaps fixed in a newer version of FRR? Or am I doing something stupid? Is there a patch for this? If not, I'm going to need to revert to quagga.
Thanks, Andy
_______________________________________________ frog mailing list frog@lists.frrouting.org https://lists.frrouting.org/listinfo/frog
-- Andrew Schorr e-mail: aschorr@telemetry-investments.com Telemetry Investments, L.L.C. phone: 917-305-1748 152 W 36th St, #402 fax: 212-425-5550 New York, NY 10018-8765
participants (2)
-
Andrew J. Schorr -
Donald Sharp