Nighthawk X8 R8500 AC5300 router available $400 msrp.

Post new topic   This topic is locked: you cannot edit posts or make replies.    DD-WRT Forum Index -> Broadcom SoC based Hardware
Goto page Previous  1, 2, 3, 4, 5, 6 ... 15, 16, 17  Next
Author Message
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Thu Aug 18, 2016 13:58    Post subject: Read after write and temperature Reply with quote
The concept of a read-after-write verify on something is not new, tape drives have done that for ages (sometimes with two heads, one read, one write). Reads don't have a cost on the devices lifespan either. I don't know if dd-wrt does this already. It might help in some situations, but I don't think its going to fix this issue.

My current gut feel, having spent some time trawling around the interweb reading reviews (that I should probably have read before purchase if I knew then, what I know now) is that this is a thermal issue.

Netgear support and Amazon reviews all say the same sort of thing (and highlight a 90 day "free phone support", as apparently its paid after that, which is far less than ideal, or what I would expect on a top-end device. Seems that some have had warranty replacements with some success, others not. This implies that there is a hardware issue here.

The trend seems to be high temperature, random reboots, slowing in performance over time, etc. Remember that the Broadcom CPU's like most modern ones have thermal envelope protection, so they slow down if they get too hot (thus the slower performance observation by others)

I did a quick test on my device with an infrared thermometer. Its currently 25 degrees in my house and and 27 in my office. The router is measuring 51 degrees on the top and 52 degrees underneath on the label. Given that this is outside on the poor thermal conducting plastic and not the heat sink below, then its going to be hotter and again, the devices below the heatsink will be hotter still.

From memory, the maximum "junction temperature" inside most chips is 125 degrees C. .. A quick check against the scant information on Broadcom and confirmed that (on a Russian site) http://caxapa.ru/thumbs/281907/MPR_11-22-10.pdf at least for an earlier generation CPU.

I really need to crack this puppy open and see what the heat sink says ...

Apparently the R8500 has a Spansion S34ML01G200TF100 NAND device in it, Spansion has been purchased by Cypress (see http://www.cypress.com/spansion-redirect) and the data sheet for the FLASH chip is here http://www.cypress.com/file/218306/download. The device has a thermal envelope of -40 to +85 degrees C.

I understand from other Broadcom based devices that the NVRAM is partition mtd/3 that is carved up from this chip. CFE, Firmware, NVRAM and built in JFFS being all on the same chip.

The device also states a 100,000 *typical* write cycle endurance and that it has 4 bit ECC per 528 bytes (xCool or 264 words (x16), but clearly the firmware would have to check for ECC errors, again I don't know if it does that.
Sponsor
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Thu Aug 18, 2016 14:43    Post subject: Re: Read after write and temperature Reply with quote
Thanks again for this discussion, by the way. It's a great relief to have someone else perplexed with these seemingly odd issues, and it's satisfying to be able to compare notes.

phoenix127 wrote:
My current gut feel, having spent some time trawling around the interweb reading reviews (that I should probably have read before purchase if I knew then, what I know now) is that this is a thermal issue.


The more I think about this, you're probably right. I think ttraff writes once per day, regardless (right?), so it would not write any *more* frequently if we were home and using the router.

That said, there's definitely an issue with NVRAM corruption. Many times I've shot a backup, told the router to reboot, and had it boot-loop. When I've done a reset and tried to restore the backup to the very same firmware that just made it ten minutes earlier, boot loops start again.

So it's not JUST heat... but heat may be causing the NVRAM corruption, too, I suppose.
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Thu Aug 18, 2016 16:18    Post subject: Dissassembly and temperatures Reply with quote
OK, so I took the plunge an decided to see how hot it was getting internally. I'm going to post this in several posts as there is a lot of info.

Disassembly
1. Remove any USB devices, since the plastic cover can't be removed with them in.
2. With the unit upside down with the Ethernet sockets facing away from you, remove the 4 plastic feet to show the T9 torx. Remove the 6 torx screws in the base
3. Gently slide the bottom cover backwards towards the Ethernet sockets, it moves about 10mm
4. Lift off the base to get to the PCB.

There are no screws holding anything in, so thing are now loose.

Note that the board can't be lifted out with all the aerials still attached and if you tip the PCB, the power cable will come out too (more on this later) and the aerial cables are not very loose with all the tape they have assembled the unit with, so take care here not to damage anything.

If you pop the PCB out, the front-left aerial next to the USB socket is horizontal and mounted on a bracket, so it needs to be nudged to get it to pop back into its hole

Assembly is the reverse of above, just make sure that the PCB is on the locating lugs in the plastic before you drop the cover back on and slide it into position.

If you find that the plastic flap on the USB sockets is annoying you, it can be easily removed, its got C shaped clips that are 90 degrees to the face, so if you open the door, you can pop it off gently by pulling it horizontally away from the unit, like pulling the USB devices out. The spring clip is held on with fixings, so doesn't come loose. It also clips back in in the same way. That's one annoyance fixed !

As to temperature, I left the powered unit on whilst I popped it apart and as this was done <30 minutes after the previous measurements, so 51 degrees on the top, 52 degrees on the bottom. Measuring the bottom of the board immediately after the cover was removed showed 60 degrees and the heat sink measured 62 degrees. This soon dropped to 52 degrees for the bottom and 53 degrees for the heat sink. I didn't check to see how the heat sink is affixed (pad to chips, or just to the metal RF shielding frames) as that would have required a full tear down, which I didn't want to do yet.

I conclude from this that its likely to be within the -40 to 85 range, but given that its not particularly hot here, if it was say 40 degrees outside rather than 25, then the ambient temperatures would be 15 degrees higher. Ignoring the fact that cooling is less effective when its hotter, then the heat sink would presumably be at least 77 (62+15) degrees which is very close to the maximum operating temperature of the chip.

As the unit cooled quickly with the covers off, perhaps the ventilation slots in the case are not quite up to it. There might be a market for after market add-on fan coolers for these units ! (feed-through barrel plug to the power socket, stick on fan to top ?)

Whilst I had the unit apart, I also investigated the 4 pin header and it is as expected a 3.3V serial port "R T G V" does indeed map to
R = RX Data -> PC TX Data
T = TX Data -> PC RX Data
G = Ground -> Ground
V = Vcc (3.3V) -> do not connect.
Its the normal 115,200 baud serial port.

This turned out to be handy, since I knocked out the power cable when manipulating the PCB, after reinserting it and doing the temperature measurements etc, noticed that the router was in a reboot loop - OK, fair cop, could have been me manhandling the router in bits or pulling out USB or similar ... but no, it was toast again. More analysis to follow, I've got some logs to read.
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Thu Aug 18, 2016 18:14    Post subject: Tracking down the cause of the stuck-in-the-reboot loop Reply with quote
As noted in the previous post, I had another corruption whilst doing the previous analysis. Luckily I had the serial port active and had PuTTY logging for some of it (I restarted and forgot to re-enable logging .. damn !)

The effect was identical to the last occasions, everything working, then without warning, bam, no boot, stuck in the reboot loop..

Rather than post the entire log here, I've pulled out what seems to be the relevant part. If someone like Kong or Brainslayer wants the full logs with the full stack trace, then I can PM it to them.

eth0: Broadcom BCM47XX 10/100/1000 Mbps Ethernet Controller 7.14.89.21 (r524987)
roboswitch: Probing device 'eth0'
roboswitch: trying a 53012! at eth0
detected CPU Port is 8
roboswitch: found a 53012! at eth0
detected CPU Port is 8
PCI: Enabling device 0001:01:00.0 (0140 -> 0142)
random: nonblocking pool is initialized
PCI: Enabling device 0002:03:00.0 (0140 -> 0142)
PCI: Enabling device 0002:04:00.0 (0140 -> 0142)
wl_module_init: passivemode set to 0x0
wl_module_init: txworkq set to 0x0
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = 84db8000
[00000000] *pgd=04db0831, *pte=00000000, *ppte=00000000
Internal error: Oops - BUG: 17 [#1] SMP ARM
Modules linked in: wl(P+) dhd igs(P) emf(P) switch_robo switch_core et(P)

CPU: 1 PID: 690 Comm: insmod Tainted: P 4.4.17 #1270
Hardware name: Northstar Prototype
task: 879dbc00 ti: 84da2000 task.ti: 84da2000
PC is at osl_pcie_rreg+0x5c/0x78
LR is at si_pmu_get_pmutimer+0x78/0x15c
pc : [<80211e50>] lr : [<80202c74>] psr: 80000013
sp : 84da39a0 ip : 84da39b0 fp : 84da39ac
r10: 00003a98 r9 : 89f08000 r8 : 89f08000
r7 : 86556380 r6 : 84da3a1c r5 : 86556380 r4 : 873b2400
r3 : 00000003 r2 : 84da39b4 r1 : 00000000 r0 : 86556380
Flags: Nzcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5387d Table: 04db804a DAC: 00000055
Process insmod (pid: 690, stack limit = 0x84da2190)
Stack: (0x84da39a0 to 0x84da4000)

So, it looks like a null pointer problem.

I did the following steps to help narrow down the problem.

30-30-30 reset
Device boots properly
Re-apply yesterday's NVRAM backup
Reboot and it crashes again.

30-30-30 reset
Took a backup of all the mtd partitions, so I could see if any are different to when its working (dd if=/dev/mtd0ro of=mtd0 onto USB device, rinse and repeat through to mtd5ro)
Re-applied current firmware, just in case the firmware was bad (Brainslayer build 30432)
Rebooted OK
Re-loaded yesterdays NVRAM backup
Reboot and it crashes again

30-30-30 reset
Re-apply previous NVRAM backup
Successful boot

The only known difference between the two is that I had exported out all the traff data and re-imported it (grep out the traff history data from nvram dump of old router, used sed to add "nvram set" prefix on all commands and added quote signs to the string being passed). All worked fine in the GUI. I'm 99% sure I rebooted after applying that yesterday.

After the successful reboot, I've copied off all the mtd partitions again to compare against.

As there isn't a binary diff I know of in Linux, i've compared the 5 before and after files using the following approach : "hexdump -C mtdx > mtdx.hex", then diff'ed the before and after .hex files)

The results are :
Partition, Size, State, Purpose
mtd0 512Kb Identical CFE
mtd1 1.5Mb Differ NVRAM
mtd2 30Mb Identical Firmware - dd-wrt image
mtd3 29Mb Identical Unknown purpose
mtd4 512Kb Identical Calibration data ??? Text data with lots of xxcalxx=val. Looks like I should keep a backup of this too !
mtd5 80Mb Identical JFFS (found from the output of mount)

Sizes are the raw dd size, not .hex files.

So the problem is related to NVRAM, which correlates to Dave's view and the advice given to Arin on this thread. The question in my mind - as this seems to be unique to this router, is there a bug in the nvram handling on this platform (Kong's 128K NVRAM update ?). I note that my current nvram usage is 57981, which probably gets nudged over the 64K by the additional traff history. The only thing against this is that I'm sure others have simpler configs and less nvram used, so they should be well below this threshold. Is this another +1 for the suggestion of moving non-config data out of NVRAM ? TRAC #5509

Dave - what does an "nvram show" on your box give for the size used ?
Also check with "nvram backup /somewhere/nvram.bak", then ls -al /somewhere/nvram.bak and check the size there too.

I'll now look at the differences between the config and see what I can find.


Last edited by phoenix127 on Thu Aug 18, 2016 18:55; edited 1 time in total
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Thu Aug 18, 2016 18:31    Post subject: >64K NVRAM issue .. Reply with quote
I've now checked the nvram dumps and my working config dump is 60K, the failing one is 78K. I don't want to restore these directly as the router reboots, which will cause the issue again.

Having looked at the script which imports my old traff data it is 66 lines and about 18K in size, which correlates to the above.

I'm therefore 99% certain this is related to >64K NVRAM usage and the 128K NVRAM expansion has some form of pointer related issue when reading the config at boot up.

Could it be that others having the same issue are storing DHCP, MAC filters, IPv6 radvd, dhcp6s or other such bulky data in NVRAM and taking it over 64K, hence experiencing the bug.

I'll run without the historic traff data for a while and see what happens.

Please can we take non-config data out of the NVRAM ???
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Thu Aug 18, 2016 19:00    Post subject: Re: >64K NVRAM issue .. Reply with quote
@phoenix127, you rock. Again, thanks for all your work and thanks simply for having this dialog. Ok... on to the good stuff.

'nvram show' yields this for me:
size: 62148 bytes (68924 left)

I cleared out my ttraff data recently, so while it's recording still it's not holding on to many KB of history. I do also store DHCP leases and other stuff in there, building up the cruft.

Your thoughts about the 64KB barrier are VERY interesting to me, because these boot-crashes remind me of the pre-128KB builds that Kong was doing for the R8500 at the start (I think December or January?). I had reported the same issue back then and suggested it might be hitting my NVRAM limit, and Kong immediately put out an update for the R8500 that raised NVRAM from 64KB to 128KB... but maybe there's another limit or something imperfect about the way that's being handled.

Perhaps we're dealing with two things here: an NVRAM limit/buffer and a heat issue. The heat issue causes the seemingly random reboots... and the NVRAM issue keeps it from successfully booting. I certainly have experienced one without the other, so they're not directly related.
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Thu Aug 18, 2016 19:22    Post subject: Reply with quote
I just wiped out my ttraff data and disabled the ttraff Daemon. My NVRAM then sat at "60 KB / 128 KB" according to the UI.

Then, after requisite backups and such, I did a restart (on Kong 30370M) and then a (dirty) update to KONG 30430, both without without issue.

I think we're getting somewhere. I bet if my NVRAM was at 65KB (or higher, as I've seen it before), my restart would have failed.
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Thu Aug 18, 2016 19:32    Post subject: Re: Tracking down the cause of the stuck-in-the-reboot loop Reply with quote
phoenix127 wrote:
Also check with "nvram backup /somewhere/nvram.bak", then ls -al /somewhere/nvram.bak and check the size there too.


Thankfully I'm totally retentive and have archives of my backups (and firmware) going back years. Right in mid-May when I started experiencing all of this in earnest was when my NVRAM backups tipped from being under 64KB to being over. Smile

[I'm factoring in a few extra KB in that number, because my current 60KB NVRAM shows up as a 63KB backup, so I'm using that same fuzzy math to corroborate my dates above]
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Thu Aug 18, 2016 19:42    Post subject: NVRAM issue and thermal issue. Reply with quote
Dave,
I am in full agreement with you here, there are two unrelated issues.

On your previous post, your figure is also very close to 64K.

My guess is that the NVRAM issue is a coding error not handling >64K, so should be easy to fix. I've PM'ed Kong to ask if he could review this thread and advise.

You could prove the 64K theory if you want by doing the following, which will force the failure and prove its not just me doing something daft

1. Backup nvram configuration to PC
2. Add some extra settings to nvram to take it beyond 64K, e.g.
nvram set dave1="a really long string with lots of random stuff in it"
nvram set dave2="something equally useless to fill up memory"
3. nvram commit
4. Reboot - will go into loop
5. 30-30-30 reset to clear nvram.
6. Re-load original backed up config
7. Reboot - will work again.
8. Post results here.

On the thermal issue, I'm not convinced that its actually causing my problem, but it is cause for concern as its too close for comfort in my mind if the weather was hotter.

There is the option in the GUI to see the CPU temperature, its in the status page under CPU. Mine is 74.7c outside of case shows 48c.

There is some dialogue on that issue re Broadcom, its not this model but details are here https://www.dd-wrt.com/phpBB2/viewtopic.php?t=290265&sid=8ae26829ea36720abd6b29689aee0077

I'll probably monitor the temperature with an external sensor like one of these:

PC fan adapter http://www.ebay.co.uk/itm/331712533319
Ordinary fan adapter http://www.ebay.co.uk/itm/401150374540

Both can directly control the fan, so its only on when needed. I'll probably mount a fan on top of the case pulling air out from the chassis to aid ventilation.

I might even play with underclocking the CPU, but that's a waste having paid out for a fast router.

I had hoped to mount mine where the e4200 lived, but with the aerials to the side, however, having seen the heat sink fin direction and the thermal problem, I probably won't do that now. Currently its looking like I'll bolt it to the wall.
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Fri Aug 19, 2016 11:58    Post subject: Demonstrating the NVRAM bug Reply with quote
I decided to simplify the configuration to see if I can force the NVRAM failure, and I can. Here's how.

Set up PC on 192.168.1.10

start ping -t 192.168.1.1 in a command prompt.

We know that TTL=100 means CFE boot loader / corrupt firmware. The router does produce some icmp replies with TTL=100 when it goes through the CFE bootloader, so this is OK, but repeated sets of TTL=100, no reply, TTL=100 shows that its not booting. This is the the kernel oops noted previously and the random reboot loops that several people have seen.

Start by 30-30-30 reset of the router
in the web interface, set password, enable ssh
ssh into the router

nvram show
shows 50096 bytes (48.9K) of NVRAM used in reset state

Run the following to put some data in NVRAM (its one wrapped line)

for d in `seq 1 44` ; do nvram set z_fill$d="ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKM
ONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789ABCDEFGHIJLKMONPQRSTUVWXYZ0123456789"; done

This takes NVRAM to 66394 bytes used (64.8K).

nvram commit
nvram show (to verify the above size used)
reboot

Watch the TTL on the ping command, you get a set of TTL=100, no reply, then TTL=64. You can now log back into router, so this worked OK.

For completeness, you could 30-30-30 reset at this point to show that the test is identical, but it seems to make no difference if you are adding data to NVRAM.

Repeat the above test but change the seq to go up to 45.

NVRAM usage this time will be 66885 (65.3K), which is just over the 64K limit, but well under the allocated 128K.

nvram commit
reboot

Ping output now shows sets of ttl=100, no reply, ttl=100, no reply
The only way to get the router back is a 30-30-30.

This seems to demonstrate that the the oops and the NVRAM usage are part of the same issue. I note that the 64K limit is actually at 65K (and above), not 64K

I have repeated the above with the loop going up to 100 which occupies 87188 bytes (85.1K), this also causes the same issue, so it looks like anything over 65K usage causes the problem.

Another oddity I noticed is that when you nvram commit, then nvram show then reboot, after the reboot more NVRAM is used than before you restarted. I assume this is some other data that DD-WRT is creating and storing. This may explain a little why working configs suddenly turn to dead ones when nothing changed.
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Fri Aug 19, 2016 13:05    Post subject: Reply with quote
Thanks, @phoenix127! I was testing this morning, as well, and have confirmed the same thing. I wanted to be certain it wasn't related to ttraff data and, as you also confirmed, it's not! Progress. Wink

Kong appears to be aware of this from his post, so hopefully he's got a solution in mind: http://www.dd-wrt.com/phpBB2/viewtopic.php?p=1043514#1043514
phoenix127
DD-WRT User


Joined: 02 Jan 2011
Posts: 80
Location: UK

PostPosted: Wed Aug 24, 2016 22:57    Post subject: Good news Reply with quote
Following a couple of PM's. I have just tested Kong build 30465M Kongac and it now works perfectly for larger NVRAM. My current configuration including old TRAFF data takes the NVRAM to 76Kb and there are no boot loops.

Thank you very much Kong. I appreciate your hard work !

Note that Brainslayer build 30471 (which is higher) does not yet include this fix.

The other difference I noticed is that Kongs build is 21Mb and Brainslayer's is 29Mb. I'm not sure what the other differences are yet as everything seems to work OK.

Dave - would you also test please ?
DaveTheNerd
DD-WRT User


Joined: 15 Jul 2008
Posts: 317

PostPosted: Thu Aug 25, 2016 0:06    Post subject: Re: Good news Reply with quote
phoenix127 wrote:
Following a couple of PM's. I have just tested Kong build 30465M Kongac and it now works perfectly for larger NVRAM. My current configuration including old TRAFF data takes the NVRAM to 76Kb and there are no boot loops.

Thank you very much Kong. I appreciate your hard work !

Note that Brainslayer build 30471 (which is higher) does not yet include this fix.

The other difference I noticed is that Kongs build is 21Mb and Brainslayer's is 29Mb. I'm not sure what the other differences are yet as everything seems to work OK.

Dave - would you also test please ?


W00t! Nice work, @Kong and @phoenix127. I just tested by not only filling up NVRAM with junk data but also filling it with my old traff data. All works fine, well past 65K. Reboots are once again reliable. No boot loops to speak of.

Our long international nightmare is over. Wink Thanks!
stalonge
DD-WRT Guru


Joined: 21 Jul 2006
Posts: 1898
Location: Fortaleza Ce Brazil

PostPosted: Thu Aug 25, 2016 8:46    Post subject: Reply with quote
Guys ,


my r8500 is working fine with my full settings ( old traff included ) ..


in few hours it will break the second day barrier


Very Happy Very Happy

_________________
DDwrt ...it rocks ....

1 R7800 54420 AP Wireguard webserver JFFS SAMBA FTP usb HD Mesh
1 R7800 54420 Cli Mesh
1 WZR1750 54389 AP Webserver Samba Wireguard
1 TP link Archer C7v5 54420 Cli Mesh
1 DD x86_64 48296 Gateway Samba Ftp Webserver
Foaley
DD-WRT Novice


Joined: 13 Dec 2016
Posts: 8

PostPosted: Tue Dec 13, 2016 8:16    Post subject: Reply with quote
I just bought an R8500, and I've tried just about everything I could find on the internet to flash this damn router with DD-WRT. Could someone please help me out, I can't stand the stock firmware, thank you!

Foaley
Goto page Previous  1, 2, 3, 4, 5, 6 ... 15, 16, 17  Next Display posts from previous:    Page 5 of 17
Post new topic   This topic is locked: you cannot edit posts or make replies.    DD-WRT Forum Index -> Broadcom SoC based Hardware All times are GMT

Navigation

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You can attach files in this forum
You can download files in this forum