[Resolved] Network volatility, network spikes

Post new topic   Reply to topic    DD-WRT Forum Index -> Advanced Networking
Goto page Previous  1, 2, 3  Next
Author Message
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Tue Oct 17, 2023 17:26    Post subject: Reply with quote
So here are my wireless from before I disabled wireless. (The 2.4Ghz radio's Network mode should say "N / G Mixed"). If there's anything that jumps out at you, please let me know.

The 3 Virtual interfaces per band are all configured the same, except one for each radio has AP isolation enabled, the others don't.

In all cases, Security is WPA/WPA2 Personal/CCMP-128 (AES), key renewal 14400s, strict rekeying disabled, 802.11r (FT) disabled, 802.11w MFP disabled, Disable EAPOL Key Retries disabled. No custom config.

I understand now that I could put the hostapd ap_max_inactivity=86400 setting here and change the Disassoc Low Ack setting under Advanced.

(Note that I'm most interested in sorting out the network instability and network spikes, since I have a proprietary WAP as a backup for wifi issues, but it would be nice if I could depend upon wifi again!)
Sponsor
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12922
Location: Netherlands

PostPosted: Tue Oct 17, 2023 17:40    Post subject: Reply with quote
From the best wireless settings thread (link already provided earlier):

Quote:
Newer builds have the option to switch between Vanilla and DD-WRT firmware for both 2.4GHz & 5GHz.
Vanilla is the original QCOM firmware from Kvalo and DD-WRT is the DD-WRT custom firmware, a modified firmware based on CandelaTech's driver but heavily modified by DDWRT developers (BS and NBD)
DDWRT firmware has some extra options e.g.:Half (10 MHz), Quarter (5 MHz) and auto ACK Timing
Most users prefer Vanilla

Airtime Fairness is reported to cause problems by some users, if you have stability problems disable Airtime Fairness.

U-APSD (Automatic Power Save) is reported to cause problems by some users, if you have stability problems disable U-APSD (Automatic Power Save).

Security
WPA2 Personal CCMP-128 (AES)

Other settings (e.g. WPA3) are not well tested and might not work reliably.


Unless you have many clients also disable RTS/CTS

Beacon/DTIM, use 200/1 or 100/2

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 18, 2023 8:38    Post subject: Re: Network volatility, network spikes Reply with quote
jtbr wrote:

Now, without mdns/avahi, without upnprd, and without wifi, it's mostly better. But now with uptime nearing 1 day, about once an hour (~4000 seconds) I'm getting delays peaking around 12 seconds for about 15 pings. (In subsequent testing, these became less regular but continued). Otherwise it looks pretty reasonable (although I don't really get why it would ever approach or exceed 500ms, and it still does). I've been able to observe top when this spike happens but unfortunately top does not update, or the ssh disconnects (!) at that moment, so it was not informative.



It appears I have found the final culprit for very long ping responses (after [seemingly] avahi reflector+upnprd): smartdns.

I removed the line
prefetch-domain yes
(which causes smartdns to periodically refresh all cached IPs) from the smartdns configuration and started smartdns with 'nice -n 2'.

Now the ping-response delays are staying under 2 seconds (see image below). (Still not ideal (!) but much better. Perhaps this is enough to cure the video calling quirks. How do ping times look for you? Staying under 500ms or occasionally doing like this, nearing 2 seconds?)

Now I'm trying to see if the Multi-gigabit network spikes are still occurring. Perhaps not?!

Also, could someone please share the irqbalance executable from a new Atheros build where it's working?

Many thanks. It has been helpful for me just to present this, by forcing myself to be more rigorous and understand what I'm seeing.


Last edited by jtbr on Wed Oct 18, 2023 11:24; edited 1 time in total
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 18, 2023 8:47    Post subject: Reply with quote
ho1Aetoo wrote:

I also assume that dnsmasq is intelligent enough and ignores duplicate entries.

Edit: just tested with 3x br0 = absolutely no difference


After running overnight again without duplicate bridges in the dnsmasq interface line, I'm back to the same results. Lots of duplicated DHCP renewals in the same second. Notably, they are being requested well before the address times out (a lot of them are from one windows machine. I assume this is a client issue and won't worry about it).

So you're right, the duplicate bridge entries don't seem to make a difference. (Also, in any case it's quite hard to keep them out as it seems that the dnsmasq config file is re-written frequently). I still haven't tried lower lease times, but I will if you think it's worthwhile.
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12922
Location: Netherlands

PostPosted: Wed Oct 18, 2023 9:46    Post subject: Re: Network volatility, network spikes Reply with quote
jtbr wrote:


It appears I have found the final culprit for very long ping responses: smartdns.

I removed the line
prefetch-domain yes
(which causes smartdns to periodically refresh all cached IPs) from the smartdns configuration and started smartdns with 'nice -n 2'.

So now the ping-response delays are staying under 2 seconds. (Still not ideal (!) but much better)


Interesting find, in the guide there is mention about it:
Quote:
Prefetch Domain:
When Enabled SmartDNS will be pre-fetching domain names to improve query hit rate.
This feature will consume more CPU when, so do not use it on low end routers
Serve Expired:
When Enabled it will improve the cache hit rate and reduce the CPU consumption.


But I did not know it has such an impact. Maybe the lack of irqbalance is also part of the problem?
irqbalance is running again on recent builds but those builds have switched to Kernel 6.1 which is a WIP

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 18, 2023 11:41    Post subject: Reply with quote
As for wireless settings:

egc wrote:
Airtime Fairness is reported to cause problems by some users, if you have stability problems disable Airtime Fairness.

Unless you have many clients also disable RTS/CTS

Beacon/DTIM, use 200/1 or 100/2


It looks like these are the three things that could be changed. I have about 50 clients, seems like RTS/CTS is useful, no?

When I look at these wifi threads, I wonder whether what is being said is: "These work optimally in my setting, and I've experimented a lot", "These work optimally, period. I know the standards and the implementations and have learned from experience in wide variety of settings", or: "Any deviation from these is potentially destabilizing".

So, like for the beacon, I'd be quite surprised if the beacon setting could be in the third category. (RTS/CTS too)
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12922
Location: Netherlands

PostPosted: Wed Oct 18, 2023 11:58    Post subject: Reply with quote
With so many clients RTS/CTS would certainly be useful.

Beacon interval of 300/1 or 400/1 should als not pose a problem.

It is just that you want to rule out anything so just use defaults.

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 18, 2023 14:39    Post subject: Reply with quote
egc wrote:
With so many clients RTS/CTS would certainly be useful.

Beacon interval of 300/1 or 400/1 should als not pose a problem.

It is just that you want to rule out anything so just use defaults.


Got it. Thanks. When I get a chance I will try disabling airtime fairness. But it'll probably not be for a while.


Unfortunately, while I'm increasingly confident that the instability in response times is mostly solved, the momentary network spikes of over 20Gbps do persist. These are most mysterious, as I can't confirm them using any other tool and I'm not even sure they're having an impact (though such a peak, if real, should probably bring the router to its knees). Maybe they're some bug in the bandwidth monitoring?

Earlier I think you mentioned perhaps there could be an issue with STP. My setup is this:

Code:
Bridge Name   STP   Interface
br0   no   eth1 vlan1 vlan11
br1   yes   vlan12
br2   yes   vlan14
br3   yes   vlan13


Does this look ok? br0 has STP Off. The others have "STP" selected. Other options are RSTP and MSTP. I have no idea what they do. Under Assign to bridge, the non-br0 vlan interfaces (vlan12/13/14) have STP "On".
ho1Aetoo
DD-WRT Guru


Joined: 19 Feb 2019
Posts: 3006
Location: Germany

PostPosted: Wed Oct 18, 2023 15:53    Post subject: Reply with quote
Well I didn't really want to write it.
But I have also seen these spikes, can happen if the WebIF / the bandwidth monitor hangs for a short time.

In principle, you should not let the WebIF run for hours in the background, that only consumes resources.

Also, the WebIF should run in an isolated window / browser.

I had also some strange problems, like the WAN port went down and the PPPoE connection reestablished itself (when the WebIF was running in the background for several hours with many other tabs).

and you can disable STP.

_________________
Quickstart guides:
use Pi-Hole as simple DNS-Server with DD-WRT
VLAN configuration via GUI - 1 CPU port
VLAN configuration via GUI - 2 CPU ports (R7800, EA8500 etc)

Routers
Marvell OCTEON TX2 - QHora-322 - OpenWrt 23.05.3 - Gateway
Qualcomm IPQ8065 - R7800 - DD-WRT - WAP
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 18, 2023 19:05    Post subject: Reply with quote
ho1Aetoo wrote:
Well I didn't really want to write it.
But I have also seen these spikes, can happen if the WebIF / the bandwidth monitor hangs for a short time.

In principle, you should not let the WebIF run for hours in the background, that only consumes resources.

Also, the WebIF should run in an isolated window / browser.

I had also some strange problems, like the WAN port went down and the PPPoE connection reestablished itself (when the WebIF was running in the background for several hours with many other tabs).

and you can disable STP.


Well thanks, that makes sense, and also makes me feel a better. At least I'm not crazy! And more importantly, it's nothing to worry about. Incidentally, it appears that running the bandwidth monitor is part of what is causing remaining unresponsiveness.

Regarding STP, it's not necessary even though 3 of my 4 subnets/bridges have routes to each other?
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Thu Oct 19, 2023 13:35    Post subject: Reply with quote
Well, I have another finding to report. I've had very good success with modifying the cpu-core affinities. Both the mean and the standard deviation of ping response times went down significantly after customizing. I'm attaching charts from before and after changing the affinities (note the scale of these has changed from prior charts). If anything, the stress on the router was higher during the time affinities were set since the unset affinities test ran mostly overnight.

To achieve this I did:

Code:
# process eth0 (wan) on cpu1 (same as eth1) -- probably the only thing that really matters
echo 2 > /proc/irq/100/smp_affinity
# process serial and usb on cpu0
echo 1 > /proc/irq/170/smp_affinity
echo 1 > /proc/irq/172/smp_affinity

# run openvpn, smartdns, and mdns-reflector on cpu0 using commands like:
taskset -c 0 [command and params to run]


I start the commands myself, so it's easy to start them with the cpu affinity, but it's also possible to change the affinity of a running process:

Code:
taskset -cp 0 [pid]


A script could be made to do this automatically. 'top' will tell you which cpu is running each command.

The web UI was consciously not being used during these tests.

Note also that these results are while also running mdns-reflector (with niceness 5), which seems to be causing less problems than avahi-reflector or mdns-repeater, the latter of which I've used for years. I'm also attaching a build of mdns-reflector for atheros, if anyone else wants to use it.
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12922
Location: Netherlands

PostPosted: Fri Oct 20, 2023 6:36    Post subject: Reply with quote
In current (K6.1) builds irqbalance should work again.

Be careful with changing running processes it might provoke a crash.

On another note I trust you are running performance governor?
You can run ondemand but then have to tweak the settings.

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Tue Oct 24, 2023 10:22    Post subject: Reply with quote
So, unfortunately my experiment with CPU affinities is not a success after all. While in the general case it performs better, when there are a lot of open connections from clients (thousands), it brings the router to its knees. It simply can't keep up, there are dozens of check_ps processes running simultaneously and taking up CPU, and network traffic is effectively halted. I suppose in this case, one CPU core simply cannot handle the network load on its own. Note that it appears that it's the number of connections that matters, not the amount of data being sent. For stability I've had to revert to the default cpu affinities.

I would be happy to try irqbalance to see how that works. If someone could send the irqbalance binary from a newer build (where it is working) I can try it. But a full upgrade is not in the cards for me at this time and anyway it would not be a fair experiment as too much else has changed.

@egc I am using the performance governor at the moment. What else would I have to change if I wanted to use the ondemand governor?
egc
DD-WRT Guru


Joined: 18 Mar 2014
Posts: 12922
Location: Netherlands

PostPosted: Tue Oct 24, 2023 10:34    Post subject: Reply with quote
on_demand could even be slower so just keep performance governor

I attached newer irqbalance but that is compiled with kernel 6.1 so fat chance it is not working.

You can see if it works with:
irqbalance -d

Run with: irqbalance -t 10

The R7800 guide has some more performance tweaks have a look there but I am not very hopeful.

_________________
Routers:Netgear R7000, R6400v1, R6400v2, EA6900 (XvortexCFE), E2000, E1200v1, WRT54GS v1.
Install guide R6400v2, R6700v3,XR300:https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=316399
Install guide R7800/XR500: https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=320614
Forum Guide Lines (important read):https://forum.dd-wrt.com/phpBB2/viewtopic.php?t=324087
jtbr
DD-WRT User


Joined: 09 Mar 2017
Posts: 100

PostPosted: Wed Oct 25, 2023 8:53    Post subject: Reply with quote
Many thanks egc. irqbalance is working.

It appears to be performing the best. It gives low latency on average, lower than but with a slightly higher variation (a few higher delays) than setting affinities manually (which was already lower than defaults). It also handles high connection load without issue, unlike putting eth0 and eth1 on cpu 2 (although interestingly irqbalance seems usually to put it all on cpu1... which should have the same problem I would think, but doesn't. I suspect irqbalance is able to modify the software interrupts, while we're only adjusting the hardware interrupts). I also can't discern an effect on bridge to bridge transfer speeds of any of the affinity options (it's still not consistent on my router and I can't explain why it changes from test to test, or for that matter within a test). In any case, I'll stick with this.
Goto page Previous  1, 2, 3  Next Display posts from previous:    Page 2 of 3
Post new topic   Reply to topic    DD-WRT Forum Index -> Advanced Networking All times are GMT

Navigation

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum