Freescale network device not activated on mpc8360 (kmeter1 board)

Thu Jan 9 06:25:54 AEDT 2020

On 08.01.2020 12:53, Matteo Ghidoni wrote:
> Hi Heiner, thank you for the quick answer.
> 
>>>  Hi all,
>>>
>>> With the introduction of the following patch, we are facing an issue with
>> the activation of the Freescale network device (ucc_geth driver) on our
>> kmeter1 board based on a MPC8360:
>>
>> +Li as ucc_geth maintainer
>>
>> Can you describe the symptoms of the issue?
> 
> I am trying to boot in NFS, but as soon as the boot process is finished there is no network connections between the board and the host.
> 
>>>
>>> commit 124eee3f6955f7aa19b9e6ff5c9b6d37cb3d1e2c
>>> Author: Heiner Kallweit <hkallweit1 at gmail.com>
>>> Date:   Tue Sep 18 21:55:36 2018 +0200
>>>
>>>     net: linkwatch: add check for netdevice being present to
>>> linkwatch_do_dev
>>>
>>> Based on my observations, just before trying to activate the device through
>> linkwatch_event, the controller wants to adjust the MAC configuration and in
>> order to achieve this it detaches the device. This avoids the activation of the
>> net device.
>>>
>> It sounds a little bit odd to rely on an asynchronous linkwatch event here.
>> Can you give a call trace?
> 
> Here is a call trace form the adjust_link function in the if condition at line 1644 (ucc_geth.c file):
> 
> CPU: 0 PID: 35 Comm: kworker/0:1 Not tainted 5.4.8-dirty #19
> Workqueue: events_power_efficient phy_state_machine
> Call Trace:
> [cf88bde8] [c02ddca8] adjust_link+0x304/0x320 (unreliable)
> [cf88be28] [c02cbf3c] phy_check_link_status+0xe4/0xfc
> [cf88be48] [c02cccdc] phy_state_machine+0x44/0x170
> [cf88be78] [c00361a0] process_one_work+0x264/0x408
> [cf88bea8] [c00370f8] worker_thread+0x140/0x53c
> [cf88bef8] [c003d818] kthread+0xdc/0x108
> [cf88bf38] [c0010274] ret_from_kernel_thread+0x14/0x1c
> 
> Here the call trace from the netif_carrier_on function just before the call to the linkwatch_fire_event function (line 498, sch_generic.c file):
> 
> CPU: 0 PID: 35 Comm: kworker/0:1 Not tainted 5.4.8-dirty #20
> Workqueue: events_power_efficient phy_state_machine
> Call Trace:
> [cf88bde8] [c0352064] netif_carrier_on+0xc4/0xc8 (unreliable)
> [cf88be08] [c02cf4ec] phy_link_change+0x84/0xb4
> [cf88be28] [c02cbf3c] phy_check_link_status+0xe4/0xfc
> [cf88be48] [c02cccdc] phy_state_machine+0x44/0x170
> [cf88be78] [c00361a0] process_one_work+0x264/0x408
> [cf88bea8] [c00370f8] worker_thread+0x140/0x53c
> [cf88bef8] [c003d818] kthread+0xdc/0x108
> [cf88bf38] [c0010274] ret_from_kernel_thread+0x14/0x1c
> 
> Moreover, I noticed that by adding the dump directly in the linkwatch_do_dev function (link_watch.c) the interface comes up correctly, because of the delay introduced by the dump_stack function.
> 
> Here another log with some prints that maybe can help to understand the situation. The prints are placed just before calling the function mentioned in the second part of the message (hopefully this will not bring more confusion):
> 
> <...>
> ubi0: available PEBs: 235, total reserved PEBs: 269, PEBs reserved for bad PEB handling: 0
> ubi0: background thread "ubi_bgt0d" started, PID 45
> ################# [phy_device.c] phy_link_change - calling netif_carrier_on (eth2)
> ################# [sched_generic.c] netif_carrier_on - calling linkwatch_fire_event (eth2)
> ################# [phy_device.c] phy_link_change - calling adjust_link (eth2)
> ################# [ucc_geth.c] adjust_link - calling ugeth_quiesce (detaching device) (eth2)
> ################# [link_watch.c] linkwatch_do_dev - checking for netif_device_present(eth2) => 0
> IP-Config: Guessing netmask 255.255.255.0
> IP-Config: Complete:
>      device=eth2, hwaddr=00:e0:df:56:54:07, ipaddr=192.168.1.20, mask=255.255.255.0, gw=255.255.255.255
>      host=kmeter1, domain=, nis-domain=(none)
>      bootserver=192.168.1.100, rootserver=192.168.1.100, rootpath=
> ################# [ucc_geth.c] adjust_link - calling ugeth_activate (attaching device) (eth2)
> ucc_geth e0103200.ucc eth2: Link is Up - 100Mbps/Full - flow control off
> rpcbind: server 192.168.1.100 not responding, timed out
> rpcbind: server 192.168.1.100 not responding, timed out
> 
> As mentioned, just before that the linkwatch checks for the net_device presence, this one is detached by the ucc_geth driver and reattached later.
> 

Detaching the netdev was introduced with 08b5e1c91ce9
("ucc_geth: Fix netdev watchdog triggering on link changes").
Most likely detaching the netdev isn't the best way to fix the original issue.
If it's just about switching the watchdog off temporarily, then maybe
calling dev_watchdog_down() is sufficient.
Relying on an asynchronous linkwatch event to active a netdev that is
marked as not present is at least questionable.

>> The driver is quite old and maybe some parts need to be improved. The
>> referenced change is more than a year old and I'm not aware of any other
>> problem with it. So it seems the change isn't wrong.
> 
> I agree. I pointed out the commit by bisecting. This gave me the direction to where the problem could be. 
> 
>>> This is already happening with older versions (I checked with the v4.14.162)
>> and also there the situation is the same, but without the additional check in
>> the if condition the device is activated.
>>>
>>> I am currently working with the v5.4.8 kernel version, but the behavior
>> remains the same also with the latest v5.5-rc4.
>>>
>>> Any idea how to solve this? Any help is appreciated.
>>>
>>> Regards,
>>> Matteo
>>>
>> Heiner
> 
> Matteo
>