[RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

Mon Jun 8 17:42:58 AEST 2020

"Aneesh Kumar K.V" <aneesh.kumar at linux.ibm.com> writes:

> On 6/3/20 1:56 PM, Jan Kara wrote:
>> On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
>>> [ forgive formatting, a series of unfortunate events has me using Outlook for the moment ]
>>>
>>>> From: Jan Kara <jack at suse.cz>
>>>>>>> These flags are device properties that affect the kernel and
>>>>>>> userspace's handling of persistence.
>>>>>>>
>>>>>>
>>>>>> That will not handle the scenario with multiple applications using
>>>>>> the same fsdax mount point where one is updated to use the new
>>>>>> instruction and the other is not.
>>>>>
>>>>> Right, it needs to be a global setting / flag day to switch from one
>>>>> regime to another. Per-process control is a recipe for disaster.
>>>>
>>>> First I'd like to mention that hopefully the concern is mostly theoretical since
>>>> as Aneesh wrote above, real persistent memory never shipped for PPC and
>>>> so there are very few apps (if any) using the old way to ensure cache
>>>> flushing.
>>>>
>>>> But I'd like to understand why do you think per-process control is a recipe for
>>>> disaster? Because from my POV the sysfs interface you propose is actually
>>>> difficult to use in practice. As a distributor, you have hard time picking the
>>>> default because you have a choice between picking safe option which is
>>>> going to confuse users because of failing MAP_SYNC and unsafe option
>>>> where everyone will be happy until someone looses data because of some
>>>> ancient application using wrong instructions to persist data. Poor experience
>>>> for users in either way. And when distro defaults to "safe option", then the
>>>> burden is on the sysadmin to toggle the switch but how is he supposed to
>>>> decide when that is safe? First he has to understand what the problem
>>>> actually is, then he has to audit all the applications using pmem whether they
>>>> use the new instruction - which is IMO a lot of effort if you have a couple of
>>>> applications and practically infeasible if you have more of them.
>>>> So IMO the burden should be *on the application* to declare that it is aware
>>>> of the new instructions to flush pmem on the platform and only to such
>>>> application the kernel should give the trust to use MAP_SYNC mappings.
>>>
>>> The "disaster" in my mind is this need to globally change the ABI for
>>> persistence semantics for all of Linux because one CPU wants a do over.
>>> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
>>> deployed base of persistent memory applications? Yes, sysfs is awkward,
>>> but it's trying to provide some relief without imposing unexplainable
>>> semantics on everyone else. I think a comprehensive (overengineered)
>>> solution would involve not introducing another "I know what I'm doing"
>>> flag to the interface, but maybe requiring applications to call a pmem
>>> sync API in something like a vsyscall. Or, also overengineered, some
>>> binary translation / interpretation to actively detect and kill
>>> applications that deploy the old instructions. Something horrid like on
>>> first write fault to a MAP_SYNC try to look ahead in the binary for the
>>> correct sync sequence and kill the application otherwise. That would at
>>> least provide some enforcement and safety without requiring other
>>> architectures to consider what MAP_SYNC_ENABLE means to them.
>> 
>> Thanks for explanation. So I absolutely agree that other architectures (and
>> even older versions of POWER architecture) must not be influenced by the new
>> tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
>> during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
>> flush instructions are required and *only in that case* decide based on the
>> prctl value whether MAP_SYNC should be allowed or not.
>> 
>
> v2 version of the patch series does that
>
> https://lore.kernel.org/linuxppc-dev/20200602074909.36738-1-aneesh.kumar@linux.ibm.com/
>
>> Whether this solution is overengineering or not depends on how you think
>> it's likely there will be applications trying to use old flush instructions
>> with MAP_SYNC on POWER10 platforms...
>> 
>
> Now considering that with ppc64 we never had a real persistent memory 
> device available for the end-user to try and the new instructions are 
> only needed on newer hardware, can we assume we have enough time to get 
> the userspace to use new instructions?
>
> As a safety net, we can keep the dax device-specific sysfs control. But 
> in reality, by the time newer hardware gets released, we can get the 
> distributions updated to flip the CONFIG_ARCH_MAP_SYNC_DISABLE=n?
>
> With this:
> 1) vPMEM continues to work and since it is a volatile region. That 
> doesn't need any flush instructions.
>
> 2) We get pmdk and other user applications updated to use new 
> instructions and make sure updated packages are made available to all 
> distributions
>
> 3) On newer hardware, the device will appear with a new compat string. 
> Hence older distributions won't initialize pmem on newer hardware.
>
> 4) If we have a newer kernel with an older distro, we use the per 
> namespace sysfs knob that prevents the usage of MAP_SYNC.
>
> 5) After a year or so we mark the CONFIG_ARCH_MAP_SYNC_DISABLE=n
> on ppc64 when we are confident that everybody is using the new flush 
> instruction.
>

Is this approach ok for distributions? If so I can repost the series
dropping the prctl change.

-aneesh