linux 4.10 on ast2400

Tue Dec 19 12:09:13 AEDT 2017

I'll check it out -- I don't imagine the kernel is deliberately
reserving this variation of memory -- I believe some driver is
corrupting memory.  With the 4.7 kernel running, there's a massive
memory leak but the system always starts with approximately the same
amount.

Patrick

On Mon, Dec 18, 2017 at 3:11 PM, David Duffey (dduffey)
<dduffey at cisco.com> wrote:
>
> /proc/zoneinfo may provide some useful hints (if it exists)
>
> Not exactly applicable but on x86 hosts I've seen hardware (via e820) reserve different amounts and addresses from boot-to-boot.  Additionally there is some logic in the kernel to reserve those ranges as certain boundaries so a some addresses would cause more reserved memory than others.
>
> -----Original Message-----
> From: openbmc [mailto:openbmc-bounces+dduffey=cisco.com at lists.ozlabs.org] On Behalf Of Patrick Venture
> Sent: Monday, December 18, 2017 3:58 PM
> To: Joel Stanley <joel at jms.id.au>
> Cc: OpenBMC Maillist <openbmc at lists.ozlabs.org>
> Subject: Re: linux 4.10 on ast2400
>
> I loaded 4.10 with some debug memory stuff, but I noticed that each reboot could have wildly different free memory.  So, here's the results from just dumping the file immediately after boot and then rebooting.
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:           42228 kB
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:            1668 kB
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:            1876 kB
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:           27464 kB
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:           12140 kB
>
> root at quanta-q71l:~# cat /proc/meminfo
> MemTotal:         115076 kB
> MemFree:            2084 kB
>
> On Thu, Nov 9, 2017 at 11:47 AM, Patrick Venture <venture at google.com> wrote:
>> I added these configurations and after ~10 reboots it wasn't
>> reproducing, but I'll keep an eye out and update over the coming days.
>>
>> Thanks!
>>
>> On Tue, Nov 7, 2017 at 1:56 AM, Joel Stanley <joel at jms.id.au> wrote:
>>> On Tue, Nov 7, 2017 at 8:09 PM, Joel Stanley <joel at jms.id.au> wrote:
>>>> On Tue, Nov 7, 2017 at 11:42 AM, Patrick Venture <venture at google.com> wrote:
>>>>> I've been doing testing with linux 4.10 on the ast2400 and on some
>>>>> percentage (20% of systems) when they boot they're not able to
>>>>> really launch applications.  The one we see failing is agetty, but
>>>>> ipmid also ends up not running.  Here is the log from what we're
>>>>> seeing on the
>>>>> quanta-q71l:
>>>>>
>>>>> [  OK  ] Started Clear one time boot overrides.
>>>>> [  OK  ] Found device /dev/ttyS4.
>>>>> [  OK  ] Found device /dev/ttyVUART0.
>>>>> [   42.360000] 8021q: adding VLAN 0 to HW filter on device eth1
>>>>> [  OK  ] Started Network Service.
>>>>> [   42.420000] 8021q: adding VLAN 0 to HW filter on device eth0
>>>>> [  OK  ] Started Phosphor Inventory Manager.
>>>>> [  OK  ] Started Phosphor Settings Daemon.
>>>>> [  OK  ] Reached target Network.
>>>>>          Starting Permit User Sessions...
>>>>> [  OK  ] Started Lightweight SLP Server.
>>>>> [  OK  ] Started Phosphor Console Muxer listening on device /dev/ttyVUART0.
>>>>> [  OK  ] Started Phosphor Inband IPMI.
>>>>> [  OK  ] Created slice system-xyz.openbmc_project.Hwmon.slice.
>>>>> [  OK  ] Started Permit User Sessions.
>>>>> [  OK  ] Started Serial Getty on ttyS4.
>>>>> [  OK  ] Reached target Login Prompts.
>>>>> [  OK  ] Reached target Multi-User System.
>>>>> [   44.530000] ftgmac100 1e680000.ethernet eth1: NCSI interface down
>>>>> [   45.800000] ftgmac100 1e660000.ethernet eth0: NCSI interface down
>>>>> [   49.430000] Unable to handle kernel paging request at virtual
>>>>> address e1a00006
>>>>> [   49.430000] pgd = 85354000
>>>>> [   49.430000] [e1a00006] *pgd=00000000
>>>>> [   49.430000] Internal error: Oops: 1 [#1] ARM
>>>>> [   49.430000] CPU: 0 PID: 932 Comm: (agetty) Not tainted
>>>>> 4.10.17-eced538e6233c50729cc107958596a1443947ba2 #1
>>>>
>>>> This SHA isn't in the OpenBMC dev-4.10 tree. Where are you getting
>>>> your kernel sources from?
>>>>
>>>> Wherever you've grabbed it from it's out of date as the line numbers
>>>> don't quite make sense.
>>>>
>>>>> [   49.430000] Hardware name: ASpeed SoC
>>>>> [   49.430000] task: 86e1c000 task.stack: 858f6000
>>>>> [   49.430000] PC is at unlink_anon_vmas+0x98/0x1b0
>>>>
>>>> We have seen memory corruption when running under Qemu. This is the
>>>> first time I've had a report of it happening on hardware.
>>>>
>>>>  https://github.com/openbmc/qemu/issues/9
>>>>
>>>> Can you share some information with how you're booting?
>>>>
>>>> Are you netbooting?
>>>>
>>>> Which u-boot tree are you using? Does it enable networking before
>>>> jumping to the kenrel? Or trigger any other kinds of DMA?
>>>
>>> Can you reproduce with some debugging turned on? Build your kernel with:
>>>
>>>  DEBUG_LIST
>>>  PAGE_POISONING
>>>  DEBUG_PAGEALLOC
>>>  DEBUG_SLAB
>>>
>>> Or even more. Take a look through the Kernel hacking menu in
>>> menuconfig and enable things until the system slows down too much to
>>> reproduce the issue :)
>>>
>>> Does it reproduce if you disable the FTGMAC100 devices (set them to
>>> status = "disabled" in your device tree, or disable them in the
>>> kernel config)?