linux 4.10 on ast2400

Wed Nov 8 02:29:19 AEDT 2017

On Tue, Nov 7, 2017 at 1:56 AM, Joel Stanley <joel at jms.id.au> wrote:
> On Tue, Nov 7, 2017 at 8:09 PM, Joel Stanley <joel at jms.id.au> wrote:
>> On Tue, Nov 7, 2017 at 11:42 AM, Patrick Venture <venture at google.com> wrote:
>>> I've been doing testing with linux 4.10 on the ast2400 and on some
>>> percentage (20% of systems) when they boot they're not able to really
>>> launch applications.  The one we see failing is agetty, but ipmid also
>>> ends up not running.  Here is the log from what we're seeing on the
>>> quanta-q71l:
>>>
>>> [  OK  ] Started Clear one time boot overrides.
>>> [  OK  ] Found device /dev/ttyS4.
>>> [  OK  ] Found device /dev/ttyVUART0.
>>> [   42.360000] 8021q: adding VLAN 0 to HW filter on device eth1
>>> [  OK  ] Started Network Service.
>>> [   42.420000] 8021q: adding VLAN 0 to HW filter on device eth0
>>> [  OK  ] Started Phosphor Inventory Manager.
>>> [  OK  ] Started Phosphor Settings Daemon.
>>> [  OK  ] Reached target Network.
>>>          Starting Permit User Sessions...
>>> [  OK  ] Started Lightweight SLP Server.
>>> [  OK  ] Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
>>> [  OK  ] Started Phosphor Inband IPMI.
>>> [  OK  ] Created slice system-xyz.openbmc_project.Hwmon.slice.
>>> [  OK  ] Started Permit User Sessions.
>>> [  OK  ] Started Serial Getty on ttyS4.
>>> [  OK  ] Reached target Login Prompts.
>>> [  OK  ] Reached target Multi-User System.
>>> [   44.530000] ftgmac100 1e680000.ethernet eth1: NCSI interface down
>>> [   45.800000] ftgmac100 1e660000.ethernet eth0: NCSI interface down
>>> [   49.430000] Unable to handle kernel paging request at virtual
>>> address e1a00006
>>> [   49.430000] pgd = 85354000
>>> [   49.430000] [e1a00006] *pgd=00000000
>>> [   49.430000] Internal error: Oops: 1 [#1] ARM
>>> [   49.430000] CPU: 0 PID: 932 Comm: (agetty) Not tainted
>>> 4.10.17-eced538e6233c50729cc107958596a1443947ba2 #1
>>
>> This SHA isn't in the OpenBMC dev-4.10 tree. Where are you getting
>> your kernel sources from?

We're on a branch based from the dev-4.10 tree.  It's just a branch
with a few extra drivers, etc -- for a different platform, actually.
So not compiled here, this should be nearly identical as dev-4.10 for
quanta-q71l's (ast2400 defconfig).

>>
>> Wherever you've grabbed it from it's out of date as the line numbers
>> don't quite make sense.
>>
>>> [   49.430000] Hardware name: ASpeed SoC
>>> [   49.430000] task: 86e1c000 task.stack: 858f6000
>>> [   49.430000] PC is at unlink_anon_vmas+0x98/0x1b0
>>
>> We have seen memory corruption when running under Qemu. This is the
>> first time I've had a report of it happening on hardware.
>>
>>  https://github.com/openbmc/qemu/issues/9

Looks like you were seeing this with 4.7 kernel in qemu as well.

We're seeing it on about 20 machines and not every boot.

>>
>> Can you share some information with how you're booting?

Booting from flash chip.

>>
>> Are you netbooting?
>>
>> Which u-boot tree are you using? Does it enable networking before
>> jumping to the kenrel? Or trigger any other kinds of DMA?

I'll have to check, we do have minor customization in u-boot, but I'll
check whether it does anything with DMA.

>
> Can you reproduce with some debugging turned on? Build your kernel with:
>
>  DEBUG_LIST
>  PAGE_POISONING
>  DEBUG_PAGEALLOC
>  DEBUG_SLAB

I'll give that a try.

>
> Or even more. Take a look through the Kernel hacking menu in
> menuconfig and enable things until the system slows down too much to
> reproduce the issue :)
>
> Does it reproduce if you disable the FTGMAC100 devices (set them to
> status = "disabled" in your device tree, or disable them in the kernel
> config)?

Are you suggesting this because of the ncsi crash?  Because that's
always happened for us on these systems, even with the 4.7 kernel --
which has been very stable.

Patrick