Friday, 2016-08-05

sjenningsm_w, so it's been a while, not sure if you remember our discussion on the realtek NIC dropping to 100Mbps a minute after boot.17:51
m_wI do17:52
m_wdid you get things figured out?17:52
sjenningsjust found out something interesting.  i was using the silverjaw lure at the time.  for a recent project i dropped the lure and used a microSD card and the problem went away.17:52
m_wso something on the lure is causing grief with the NIC17:53
sjenningsseems like it, not sure if it could be something with the memory mapping or something electrical (additional power draw)17:54
m_wdoes it happen with the lure attached and nothing populated in the slots?17:54
sjenningsgood question, let me try17:55
m_wat least we have another clue17:55
sjenningsso with nothing in the msata slot, the problem does not occur18:00
sjenningsm_w, ^18:00
m_wvery strange18:00
sjenningslure is attached but nothing in the slots18:00
m_wbefore you had an mSATA right?18:01
m_wcan you have an mSATA installed and still boot from SD?18:01
sjenningsm_w, let me try18:01
sjenningsgah, this doesn't make sense. now the mSATA drive is in the lure, booted from microSD, and it still works18:08
sjenningsm_w, ^18:08
m_wokay is the mSATA mounted?18:09
sjenningsm_w, i did that, and put it under some load doing a dd of a 300MB file and it stayed stable18:09
m_wthis is strange18:10
m_wkeep the mSATA load going for 10 minutes and see if it drops18:11
m_win the meantime lets think about what else may have been different18:11
m_ware you using the same kernel/distro on the uSD as the mSATA?18:12
sjenningsm_w, yes same kernel.  just booted from the mSATA(which the microSD card still inserted) and the problem came back.  it seems the booting from the mSATA drive is required to reproduce.18:14
sjennings*with the18:16
m_wsjennings: what is changed in order to boot from uSD?18:17
m_wsjennings: just jumper settings?18:17
sjenningsi just get into the UEFI (F2) and select the boot device18:18
m_wso boot back into the uSD and lets try reproducing it18:20
sjenningsm_w, ok done18:21
m_wcreate a loop that accesses the mSATA continually for at least 10 minutes18:22
m_wperhaps creating a file with dd, moving to another file, and repeating18:23
m_wmaybe put a sync after the move18:24
sjenningsm_w, so i'm dd'ing read (to avoid wear for now) on the mSATA and iperf'ing the NIC ad gigabit.  cpu is 0% idle.  everything is stable so far.18:25
m_wit would be interesting to monitor the interrupts on the uSD and mSATA boots to compare, we an do that next if this is fruitless18:26
sjenningsm_w, nothing so far, pushing 250MB/s over mSATA and gigabit over NIC and it is stable18:29
m_wthe was almost always after 10 minutes before right?18:31
m_wit was18:32
sjennings1 minute18:32
m_wsjennings: still no failure?18:49
sjenningsm_w, nope completely solid when booted from uSD18:49
m_wlets take a look at the interrupts with uSD boot18:50
m_wcat /proc/interrupts18:51
m_wlog that18:51
m_wthen boot into mSATA and monitor the interrupts until the failure occurs18:51
sjenningsso you want interrupt counts after the same amount of uptime for both situations?18:52
m_wwatch -n1 "cat /proc/interrupts"18:54
m_wthat will allow you to monitor the interrupt as they occur as well18:54
sjenningsm_w, what am i looking for?18:57
m_winterrupt flood18:58
m_ware any of the relevant interrupts shared?18:59
sjenningsafter about 5 minutes of uptime19:01
sjenningsachi is about 10 interrupt/s19:01
sjenningseverything seems normal19:01
m_wthat is on mSATA or uSD?19:01
sjenningsfrom uSD, the achi interrupt count was low (~200) and mmc2 is about 10 int/s19:02
sjenningsuSD about 80 seconds of uptime
m_wcan you provide the kernel messages of booting on each device?19:08
m_wI am not seeing anything that stands out19:09
sjenningsm_w, you and me both.  i'm trying something. i'm reloading the r8169 driver to see if it stays gigabit after reload when booting from mSATA.19:10
m_wI see a serial interrupt that is not there for the uSD19:12
m_w4:        847          0   IO-APIC    4-edge      serial19:12
sjenningsso i just did that same stress (dd + iperf) after reloading the driver booted from the sata, and it is stable so far19:15
m_wI thought you tried that before?19:15
sjenningsme too19:16
sjenningsthat's why i'm waiting19:16
sjenningsi have changed my switch since the last time as well just to remove that from the equation19:18
sjenningscould have been a compounding issue before (bad autoneg on the switch)19:18
sjenningsok, so it is stable this way for now.  i'm going to get some dmesgs.19:19
m_wany idea why the serial interrupt would happen on one but not the other?19:19
sjenningsm_w, no idea.  i don't have anything on the serial port atm.19:20
m_wmaybe different bootargs19:20
sjenningsoh, i do have a getty running one case but not the other19:21
m_wthat'll do it19:23
m_wI would shut it off to eliminate differences just in case19:23
m_wso the mSATA fails once but not after the reload of the driver, and uSD never fails19:25
sjenningsm_w, well great.. now it is stable for both19:32
m_wwhat changed?19:32
sjenningsi had the idea of booting in rescue mode ( for systemd, basically single user mode) just to remove most of the userspace interference, and it doesn't happen.  so there must be some difference in the userspaces.19:34
m_wso it is a software issue19:39
sjenningsm_w something in the userspace of the stat install must be triggering (either on purpose or hitting a bug) and NIC reset19:39
sjenningshaha yes, doesn't appear to be hardware then.  just user error, again *sigh*19:40
m_wit must be in the boot process somewhere19:41
sjenningsyes, or one of the services started in multi-user mode19:41
sjenningswhat a mess. thanks for your time (that i've wasted!)19:41
m_wnah this is fun19:42
m_wI like debugging stuff19:42
m_wotherwise I would have kept my mouth shut in the first place19:42
sjenningsi guess you are in the right line of work then :)19:42
m_wI would like to see the root cause19:42
m_wI blame systemd :D19:43
sjenningsm_w i'll let you know when i find it.  now that i have a good state and a bad state, should be able to add things one at a time until it breaks.19:43
m_wsome kind of throttling perhaps?20:01
