Posted: Mon Oct 16, 2023 20:00 Post subject: [Resolved] Network volatility, network spikes
Hello,
I've faced a number of issues lately with my Netgear R7800 router, now running r53445 (details of setup at bottom).
Some things I think I've figured out, others not. Any feedback is much appreciated.
1) I had a lot of wireless problems with all the clients being dropped at once and not being able to reconnect. It would happen about every 12-24 hours. Restarting hostapd would fix it most of the time, but not always (sometimes it wouldn't restart after being stopped).
Code:
Aug 30 10:39:58 ddwrt_main daemon.info hostapd: wlan0.3: STA 8e:ab:3d:a6:29:28 IEEE 802.11: disassociated due to inactivity
Aug 30 10:39:59 ddwrt_main daemon.info hostapd: wlan0.3: STA 8e:ab:3d:a6:29:28 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:40:21 ddwrt_main daemon.info hostapd: wlan0.2: STA 14:59:c0:d5:dc:33 IEEE 802.11: disassociated due to inactivity
Aug 30 10:40:22 ddwrt_main daemon.info hostapd: wlan0.2: STA 14:59:c0:d5:dc:33 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:40:31 ddwrt_main daemon.info hostapd: wlan0.3: STA b0:72:bf:ea:7d:89 IEEE 802.11: disassociated due to inactivity
Aug 30 10:40:32 ddwrt_main daemon.info hostapd: wlan0.3: STA b0:72:bf:ea:7d:89 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:40:34 ddwrt_main daemon.info hostapd: wlan1.2: STA 14:59:c0:d5:dc:33 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:40:35 ddwrt_main daemon.info hostapd: wlan0.3: STA 74:e2:0c:46:fe:91 IEEE 802.11: disassociated due to inactivity
Aug 30 10:40:36 ddwrt_main daemon.info hostapd: wlan0.3: STA 74:e2:0c:46:fe:91 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:40:44 ddwrt_main daemon.info hostapd: wlan0.1: STA 4a:ff:7b:6a:24:e1 IEEE 802.11: disassociated due to inactivity
Aug 30 10:40:45 ddwrt_main daemon.info hostapd: wlan0.1: STA 4a:ff:7b:6a:24:e1 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
Aug 30 10:41:29 ddwrt_main daemon.info hostapd: wlan0: STA b2:f6:f5:11:76:e6 IEEE 802.11: disassociated due to inactivity
Aug 30 10:41:30 ddwrt_main daemon.info hostapd: wlan0: STA b2:f6:f5:11:76:e6 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE)
The problems seemed to get better after I added this to the startup commands (I also switched from channel 9 to 11 for the 2.5G radio):
Code:
## possible fix for wifi bulk-deauth and then wifi invisible?
# don't disassociate bad connections, and only query for inactive clients after 1 day of inactivity
sed -i 's/disassoc_low_ack=1/disassoc_low_ack=0\nap_max_inactivity=86400/' /tmp/wlan0_hostap.conf
sed -i 's/disassoc_low_ack=1/disassoc_low_ack=0\nap_max_inactivity=86400/' /tmp/wlan1_hostap.conf
killall hostapd
sleep 2
hostapd -B -P /var/run/wlan0_hostapd.pid /tmp/wlan0_hostap.conf
hostapd -B -P /var/run/wlan1_hostapd.pid /tmp/wlan1_hostap.conf
Ultimately I just decided to quit spending so much effort on the wifi and now got a proprietary WAP so I intend use dd-wrt as the wired-only gateway/router.
2) I have a longstanding issue with network volatility that I haven't been able to get to the bottom of. There would be freezes in skype/teams calls, and big, short drops in throughput during speed tests. Two clear metrics I thought I could check into were responsiveness of the router, and looking at bandwidth usage.
2a) PING test:
Code:
$ ping -s 1024 192.168.9.1 | tee netstability.log
PING 192.168.9.1 (192.168.9.1) 1024(1052) bytes of data.
1032 bytes from 192.168.9.1: icmp_seq=1 ttl=64 time=1.02 ms
1032 bytes from 192.168.9.1: icmp_seq=2 ttl=64 time=0.479 ms
1032 bytes from 192.168.9.1: icmp_seq=3 ttl=64 time=0.487 ms
...
Then I plot the response times. At first, they were looking like this:
Some response times were over 15 seconds, and often over 2 seconds! Then I disabled avahi / reflector and upnp repeater (upnprd, which just caches upnp announces), and did what I said about wifi in step/question (1) and things were looking better (after reboots):
Though I can't rule out other changes in the network environment or other things going on unfortunately.
Now, without mdns/avahi, without upnprd, and without wifi, it's mostly better. But now with uptime nearing 1 day, about once an hour (~4000 seconds) I'm getting delays peaking around 12 seconds for about 15 pings. (In subsequent testing, these became less regular but continued). Otherwise it looks pretty reasonable (although I don't really get why it would ever approach or exceed 500ms, and it still does). I've been able to observe top when this spike happens but unfortunately top does not update, or the ssh disconnects (!) at that moment, so it was not informative.
(to be continued next message)
Last edited by jtbr on Fri Mar 22, 2024 17:25; edited 6 times in total
2b) I'm getting inexplicable bandwidth spikes, like a network storm shown on the bandwidth monitor:
Very brief spikes above 10Gbps happen about every 5 or 10 minutes once the router has been up a little while. It seems to affect all the bridges, but separately (not at the same time, although they are mostly connected locally):
I have watched iftop on eth1 while the bandwidth monitor showed such a spike on eth1, but did not notice anything out of the ordinary. I'm not seeing signs of this elsewhere! What could it be? The numbers are implausible, this is a 1Gbps router. But anything approaching this could explain some momentary sluggishness for sure.
3) Bugs noted:
3a) In debugging volatility, I noticed I seemed to be getting duplicate DHCP messages:
Code:
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPINFORM(br0) 192.168.9.238 14:f6:d8:83:b0:d5
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPACK(br0) 192.168.9.238 14:f6:d8:83:b0:d5 L201582949
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPINFORM(br0) 192.168.9.238 14:f6:d8:83:b0:d5
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPACK(br0) 192.168.9.238 14:f6:d8:83:b0:d5 L201582949
It appears this is a bug where DD-WRT is putting bridges into the the first line of the dnsmasq config file TWICE, so DHCP is listening, and responding, twice:
Code:
interface=br0,br1,br2,br3,br1,br2,br3
3b) Quite minor, but in Setup->Networking, setting the TX Queue Length of bridges seems to have no effect. Setting them via command works:
Code:
ip link set qlen 2000 dev br0 up
3c) I'm still not sure what the cause of a problem is where clients on the direct internet access bridges (the ones that don't use VPN) lose their ability to access the internet. In this scenario, clients on the vpn-based bridges can still access the internet, as can the router. My custom routes are all as they should be. But the route to the DSL modem is missing from the routing table (ip route default gw 192.168.178.1 dev eth0), even though the ppp0 route (across which all connections to the internet should traverse) is there. It seems to be some wierd timing issue where eth0 is not obtaining its IP address via DHCP, but yet ppp0 still passes just fine. Restarting pppd seems to correct the problem in most cases, but only after a sometimes long delay. And I don't know the cause, perhaps it was because I was having to restart hostapd manually?
I realize that this is a lot, thanks for reading. Again, appreciate any help.
--- My setup ---
ddwrt_main R7800 used as a PPPoE gateway (r53445) with QoS for WAN (SFE disabled). It has 4 bridges: one bridge is direct traffic to the internet, one for IoT -- also direct but with extra safeguards. The other two have all external traffic routed via two separate openvpn client connections. Each bridge has its own subnet, vlan tag, and wireless ssid. With the exception of IoT, all the bridges/subnets have routes to each other. There are 2 openvpn clients (via script), 1 openvpn server (via gui), smartdns (via script)+dnsmasq (via GUI) for dhcp/local dns. My network is as follows:
Code:
DSL Router
|
ddwrt_main GATEWAY DD-WRT WAP
Netgear R7800 --- Managed switch --- Netgear AC1450 --- My normal work location
| (tagged) Wifi extension wired or wireless
(tagged) w/ virtual IFs
|
New WAP
Last edited by jtbr on Tue Oct 17, 2023 20:05; edited 1 time in total
I realize that irqbalance is broken on this builds, so potentially some instability could be due to poor allocation of processors. So here are some data on that:
Curious if you are getting duplicate bridges whenever you type #brctl show.
I also get duplicates in my dnsmasq.conf but not in brctl. Weird. I am not using the WAN however and it's disabled. It's a wireless WAP/VAP and switch w/4 bridges. The irqbalance issue doesn't seem to affect mine much. This router is used mostly for high bandwidth multi-media devices and works well (I can run one 4K stream and two 1080P streams simultaneously with zero buffering), but your case seems entirely different. Impressive research BTW. _________________ Linksys EA8500 (Internet Gateway, AP/VAP) - DD-WRT r53562
Features in use: WDS-AP, Multiple VLANs, Samba, WireGuard, Entware: mqtt, mlocate
Wireless 5ghz only
Netgear R7800 (WDS-AP, WAP, VAP) - DD-WRT r55779
Features in use: multiple VLANs over single trunk port
Linksys EA8500 WDS Station x2 - DD-WRT r55799
Netgear R6400v2 WAP, VAP 2.4ghz only w/VLANs over single trunk port. DD-WRT r55779
OSes: Fedora 38, 9 RPis (2,3,4,5), 20 ESP8266s: Straight from Amiga to Linux in '94, never having owned a Windows PC.
Lexridge, thanks for your reply. Yeah I also doubt this has to do with irqbalance or processor affinities, but wanted to put that out there just in case.
brctl show looks normal to me:
Code:
bridge name bridge id STP enabled interfaces
br0 8000.3894ed1578ab no eth1
vlan1
vlan11
br1 8000.3894ed1578ab yes vlan12
br2 8000.3894ed1578ab yes vlan14
br3 8000.3894ed1578ab yes vlan13
This is now that I've turned off the wifi. It used to show those interfaces too. (wlan0, wlan0.1, etc).
Joined: 16 Nov 2015 Posts: 6447 Location: UK, London, just across the river..
Posted: Tue Oct 17, 2023 11:44 Post subject:
The disconnecting client problem with R7800 is nothing new...in my case it happens to have those in the logs...but actually my clients are fine they dont loose connections or connectivity issues..
do you use Vanilla or DDWRT drivers ?
To help you more we need to know all your wifi settings...
Those clients that never re-connect and drop ... it could be on their side... too
Also client names and static ip's do matter too..
I do tend to have a Lease Expiration time to 360...or in busy environment even lower to 240, 120 or even 60... imagine a train station of bus station where tons of new client will reserve all the possible DHCP table, so any new clients wont be given IP's couse all booked already....
Also brctl show... brctl always had a trouble to delete/remove the old bridges, so sometimes those need a manual removal via cli ...
apart of that cat /tmp/dnsmasq.conf will always show you a duplicate entries, its a nothing new, but this is not a harm, as far as i was told
This command will show you the hardware interfaces and their mac's, where ive found, you may have some left overs too, as an old interfaces that you created and deleted, but they still exist and cary on..(bridges (br0) are interfaces and they are given a status)...
But...be wise and do not remove interfaces, that do exist as you may get into a trouble...
nvram show | grep hwaddr - to check..
nvram unset br2_hwaddr - to remove..
nvram commit - to commit changes..
reboot - to reboot the system..
irqbalance is fixed and available on builds after 10-14-2023-r53633...im on 53662 and can confirm irqbalance is present..new build will follow soon...as those builds are on k6.1 and there will be lots of WIP...
apart of that, you can always set CPU affinity manually ..just add those commands to save start up _________________ Atheros
TP-Link WR740Nv1 ---DD-WRT 55630 WAP
TP-Link WR1043NDv2 -DD-WRT 55723 Gateway/DoT,Forced DNS,Ad-Block,Firewall,x4VLAN,VPN
TP-Link WR1043NDv2 -Gargoyle OS 1.15.x AP,DNS,QoS,Quotas
Qualcomm-Atheros
Netgear XR500 --DD-WRT 55779 Gateway/DoH,Forced DNS,AP Isolation,4VLAN,Ad-Block,Firewall,Vanilla
Netgear R7800 --DD-WRT 55819 Gateway/DoT,AD-Block,Forced DNS,AP&Net Isolation,x3VLAN,Firewall,Vanilla
Netgear R9000 --DD-WRT 55779 Gateway/DoT,AD-Block,AP Isolation,Firewall,Forced DNS,x2VLAN,Vanilla
Broadcom
Netgear R7000 --DD-WRT 55460 Gateway/SmartDNS/DoH,AD-Block,Firewall,Forced DNS,x3VLAN,VPN
NOT USING 5Ghz ANYWHERE
------------------------------------------------------
Stubby DNS over TLS I DNSCrypt v2 by mac913
Posted: Tue Oct 17, 2023 12:00 Post subject: Re: Network volatility, network spikes
jtbr wrote:
The problems seemed to get better after I added this to the startup commands (I also disabled WMM for the 2.4g radio and switched from channel 9 to 11):
Code:
## possible fix for wifi bulk-deauth and then wifi invisible?
# don't disassociate bad connections, and only query for inactive clients after 1 day of inactivity
sed -i 's/disassoc_low_ack=1/disassoc_low_ack=0\nap_max_inactivity=86400/' /tmp/wlan0_hostap.conf
sed -i 's/disassoc_low_ack=1/disassoc_low_ack=0\nap_max_inactivity=86400/' /tmp/wlan1_hostap.conf
killall hostapd
sleep 2
hostapd -B -P /var/run/wlan0_hostapd.pid /tmp/wlan0_hostap.conf
hostapd -B -P /var/run/wlan1_hostapd.pid /tmp/wlan1_hostap.conf
"disassoc_low_ack" Disconnects clients who don't respond
If you deactivate "disassoc_low_ack", the clients are not disconnected but it has a negative impact on network performance.
There is also a switch for this in the advanced WLAN settings.
and you can add "ap_max_inactivity" as "custom config" in the "wireless security" tab.
jtbr wrote:
3) Bugs noted:
3a) In debugging volatility, I noticed I seemed to be getting duplicate DHCP messages:
Code:
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPINFORM(br0) 192.168.9.238 14:f6:d8:83:b0:d5
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPACK(br0) 192.168.9.238 14:f6:d8:83:b0:d5 L201582949
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPINFORM(br0) 192.168.9.238 14:f6:d8:83:b0:d5
Oct 16 08:19:22 ddwrt_main daemon.info dnsmasq-dhcp[15811]: DHCPACK(br0) 192.168.9.238 14:f6:d8:83:b0:d5 L201582949
It appears this is a bug where DD-WRT is putting bridges into the the first line of the dnsmasq config file TWICE, so DHCP is listening, and responding, twice:
Code:
interface=br0,br1,br2,br3,br1,br2,br3
This has been the case for years and has had no negative effects.
You can also edit the dnsmasq.conf and restart dnsmasq manually - I doubt there will be a difference.
Posted: Tue Oct 17, 2023 14:34 Post subject: Re: Network volatility, network spikes
ho1Aetoo wrote:
"disassoc_low_ack" Disconnects clients who don't respond
If you deactivate "disassoc_low_ack", the clients are not disconnected but it has a negative impact on network performance.
There is also a switch for this in the advanced WLAN settings.
and you can add "ap_max_inactivity" as "custom config" in the "wireless security" tab.
Didn't realize these could be set by GUI, thanks. In my case, it's a home with the same devices, some might come and go or move between waps, but we're not really adding or removing devices so it seemed fine to me. If you were a coffee shop or something, perhaps it would be problematic.
ho1Aetoo wrote:
Code:
interface=br0,br1,br2,br3,br1,br2,br3
This has been the case for years and has had no negative effects.
So, I've found that those duplicate DHCPREQUEST+DHCPACK messages have mostly gone away since I removed the duplicate bridge listings. Not a big reduction in network traffic, and probably (?) doesn't mess up the clients, so not a huge deal, but still. It seems that dnsmasq is listening and responding twice per bridge.
Any ideas for what might be behind the network spikes?
I only tested this with my VAPs because the interfaces for the VAPs are also entered twice.
I see absolutely no difference at all between 1x wlan0.1 and 2x wlan0.1 and 3x wlan0.1
I also assume that dnsmasq is intelligent enough and ignores duplicate entries.
The disconnecting client problem with R7800 is nothing new...in my case it happens to have those in the logs...but actually my clients are fine they dont loose connections or connectivity issues..
In my case, it was ALL clients on that radio, and then they could not reconnect until I restart hostapd or the router
Alozaros wrote:
do you use Vanilla or DDWRT drivers ?
To help you more we need to know all your wifi settings...
I've always used Vanilla drivers. I'll post my wifi settings in another email, but I think all my other issues are independent of wifi.
Alozaros wrote:
I do tend to have a Lease Expiration time to 360...or in busy environment even lower to 240, 120 or even 60... imagine a train station of bus station where tons of new client will reserve all the possible DHCP table, so any new clients wont be given IP's couse all booked already....
So, that's interesting. DHCP lease of 360? I've actually set mine higher than default: 2880m, because I prefer they not change much and it should reduce network traffic a bit. Also, I have 4 subnets/VAPs and none of them get close to full, so no risk of running out of IPs. Is there an argument that making them renew more frequently is somehow more stable or works well with a wider array of clients?
Alozaros wrote:
But...be wise and do not remove interfaces, that do exist as you may get into a trouble...
nvram show | grep hwaddr - to check..
nvram unset br2_hwaddr - to remove..
nvram commit - to commit changes..
reboot - to reboot the system..
So that's also interesting. Except for those times when I couldn't get hostapd to restart, 'brctl show' normally looks fine. But when I look at nvram, there are some entries that don't belong and must have been created in the process of modifying the gui settings and hitting apply:
Only br0/1/2/3 and br0.11,br1.12,br2.13,br3.14 should be there. I'll remove them and see what happens. Also interesting they're ll the same mac address? I guess that's ok/normal.
Alozaros wrote:
irqbalance is fixed and available on builds after 10-14-2023-r53633...im on 53662 and can confirm irqbalance is present..new build will follow soon...as those builds are on k6.1 and there will be lots of WIP...
apart of that, you can always set CPU affinity manually ..just add those commands to save start up
Yeah, I saw the big upgrades and thought I should wait until the dust settles
Although I suspect I could just use the irqbalance executable and it would just work if I copied it on. Any chance you could share that from a newer build?
Otherwise, any tips for how to set the affinity? It would be nice if somehow one processor were always dedicated to dealing with network traffic while the other could do whatever else the router was running, so network traffic never waited for other procs. Probably not possible though I guess.
I only tested this with my VAPs because the interfaces for the VAPs are also entered twice.
I see absolutely no difference at all between 1x wlan0.1 and 2x wlan0.1 and 3x wlan0.1
I also assume that dnsmasq is intelligent enough and ignores duplicate entries.
Edit: just tested with 3x br0 = absolutely no difference
Try leaving it running with multiple of those bridges, and let me know what you see. I've only been running without dupes in the interface line for 2 days, so maybe it's spurious. But the night before I got several dozen duplicate DHCPREQUEST+DHCPACK pairs in the same second, now only a few.
But note that I never seemed to see duplicates associated with DHCP discovery (DHCPDISCOVER/DHCPOFFER), but only with DHCP renewal.
Make sure the router is using Performance governor.
You can manual balance irq's.
See R7800 Install guide link in my signature at the end the "Performance" section
SFE can introduce some lag when switching, I got best result disabling it, but YMMV.
I have not seen the problems you describe maybe you created some loop which stp needs to counter? Just a thought
Thanks, yes I think I learned about the processor affinity from your very helpful guide. I also tried to set wifi via the best wireless settings. I'll post my next, but I've now disabled wifi and am now using a WAP because of the frequent disconnect problem.
I have confirmed, I am using the performance governor. Shortcut Forwarding Engine is disabled.
Regarding load balance, I see that although the eth0 IRQ is set to 2 and eth1 IRQ is set to 3 (which I believe means either CPU 1 or 2), all the eth1 IRQs go to CPU2.
If I want to ensure network traffic is never waiting for user processes on the router, perhaps I could try setting both eth0 and eth1 to CPU 1 and all rescheduling, function call, and USB-related interrupts to CPU 2 -- so that one processor is always dedicated to network traffic? The other big user of interrupts is adm_dma. I assume dma means "direct memory access" but not sure what adm stands for. Could that be moved to CPU2 as well? Lots I don't understand the subtleties of here so might be best to leave alone.