linux 4.10 on ast2400

Tue Dec 19 10:11:03 AEDT 2017

/proc/zoneinfo may provide some useful hints (if it exists)

Not exactly applicable but on x86 hosts I've seen hardware (via e820) reserve different amounts and addresses from boot-to-boot.  Additionally there is some logic in the kernel to reserve those ranges as certain boundaries so a some addresses would cause more reserved memory than others. 

-----Original Message-----
From: openbmc [mailto:openbmc-bounces+dduffey=cisco.com at lists.ozlabs.org] On Behalf Of Patrick Venture
Sent: Monday, December 18, 2017 3:58 PM
To: Joel Stanley <joel at jms.id.au>
Cc: OpenBMC Maillist <openbmc at lists.ozlabs.org>
Subject: Re: linux 4.10 on ast2400

I loaded 4.10 with some debug memory stuff, but I noticed that each reboot could have wildly different free memory.  So, here's the results from just dumping the file immediately after boot and then rebooting.

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:           42228 kB

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:            1668 kB

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:            1876 kB

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:           27464 kB

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:           12140 kB

root at quanta-q71l:~# cat /proc/meminfo
MemTotal:         115076 kB
MemFree:            2084 kB

On Thu, Nov 9, 2017 at 11:47 AM, Patrick Venture <venture at google.com> wrote:
> I added these configurations and after ~10 reboots it wasn't 
> reproducing, but I'll keep an eye out and update over the coming days.
>
> Thanks!
>
> On Tue, Nov 7, 2017 at 1:56 AM, Joel Stanley <joel at jms.id.au> wrote:
>> On Tue, Nov 7, 2017 at 8:09 PM, Joel Stanley <joel at jms.id.au> wrote:
>>> On Tue, Nov 7, 2017 at 11:42 AM, Patrick Venture <venture at google.com> wrote:
>>>> I've been doing testing with linux 4.10 on the ast2400 and on some 
>>>> percentage (20% of systems) when they boot they're not able to 
>>>> really launch applications.  The one we see failing is agetty, but 
>>>> ipmid also ends up not running.  Here is the log from what we're 
>>>> seeing on the
>>>> quanta-q71l:
>>>>
>>>> [  OK  ] Started Clear one time boot overrides.
>>>> [  OK  ] Found device /dev/ttyS4.
>>>> [  OK  ] Found device /dev/ttyVUART0.
>>>> [   42.360000] 8021q: adding VLAN 0 to HW filter on device eth1
>>>> [  OK  ] Started Network Service.
>>>> [   42.420000] 8021q: adding VLAN 0 to HW filter on device eth0
>>>> [  OK  ] Started Phosphor Inventory Manager.
>>>> [  OK  ] Started Phosphor Settings Daemon.
>>>> [  OK  ] Reached target Network.
>>>>          Starting Permit User Sessions...
>>>> [  OK  ] Started Lightweight SLP Server.
>>>> [  OK  ] Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
>>>> [  OK  ] Started Phosphor Inband IPMI.
>>>> [  OK  ] Created slice system-xyz.openbmc_project.Hwmon.slice.
>>>> [  OK  ] Started Permit User Sessions.
>>>> [  OK  ] Started Serial Getty on ttyS4.
>>>> [  OK  ] Reached target Login Prompts.
>>>> [  OK  ] Reached target Multi-User System.
>>>> [   44.530000] ftgmac100 1e680000.ethernet eth1: NCSI interface down
>>>> [   45.800000] ftgmac100 1e660000.ethernet eth0: NCSI interface down
>>>> [   49.430000] Unable to handle kernel paging request at virtual
>>>> address e1a00006
>>>> [   49.430000] pgd = 85354000
>>>> [   49.430000] [e1a00006] *pgd=00000000
>>>> [   49.430000] Internal error: Oops: 1 [#1] ARM
>>>> [   49.430000] CPU: 0 PID: 932 Comm: (agetty) Not tainted
>>>> 4.10.17-eced538e6233c50729cc107958596a1443947ba2 #1
>>>
>>> This SHA isn't in the OpenBMC dev-4.10 tree. Where are you getting 
>>> your kernel sources from?
>>>
>>> Wherever you've grabbed it from it's out of date as the line numbers 
>>> don't quite make sense.
>>>
>>>> [   49.430000] Hardware name: ASpeed SoC
>>>> [   49.430000] task: 86e1c000 task.stack: 858f6000
>>>> [   49.430000] PC is at unlink_anon_vmas+0x98/0x1b0
>>>
>>> We have seen memory corruption when running under Qemu. This is the 
>>> first time I've had a report of it happening on hardware.
>>>
>>>  https://github.com/openbmc/qemu/issues/9
>>>
>>> Can you share some information with how you're booting?
>>>
>>> Are you netbooting?
>>>
>>> Which u-boot tree are you using? Does it enable networking before 
>>> jumping to the kenrel? Or trigger any other kinds of DMA?
>>
>> Can you reproduce with some debugging turned on? Build your kernel with:
>>
>>  DEBUG_LIST
>>  PAGE_POISONING
>>  DEBUG_PAGEALLOC
>>  DEBUG_SLAB
>>
>> Or even more. Take a look through the Kernel hacking menu in 
>> menuconfig and enable things until the system slows down too much to 
>> reproduce the issue :)
>>
>> Does it reproduce if you disable the FTGMAC100 devices (set them to 
>> status = "disabled" in your device tree, or disable them in the 
>> kernel config)?