[PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()
Christian Brauner
brauner at kernel.org
Tue Jul 23 21:11:51 AEST 2024
On Tue, Jul 23, 2024 at 12:45:33PM GMT, Jan Kara wrote:
> On Tue 23-07-24 09:59:54, David Howells wrote:
> > When using cachefiles, lockdep may emit something similar to the circular
> > locking dependency notice below. The problem appears to stem from the
> > following:
> >
> > (1) Cachefiles manipulates xattrs on the files in its cache when called
> > from ->writepages().
> >
> > (2) The setxattr() and removexattr() system call handlers get the name
> > (and value) from userspace after taking the sb_writers lock, putting
> > accesses of the vma->vm_lock and mm->mmap_lock inside of that.
> >
> > (3) The afs filesystem uses a per-inode lock to prevent multiple
> > revalidation RPCs and in writeback vs truncate to prevent parallel
> > operations from deadlocking against the server on one side and local
> > page locks on the other.
> >
> > Fix this by moving the getting of the name and value in {get,remove}xattr()
> > outside of the sb_writers lock. This also has the minor benefits that we
> > don't need to reget these in the event of a retry and we never try to take
> > the sb_writers lock in the event we can't pull the name and value into the
> > kernel.
>
> Well, it seems like you are trying to get rid of the dependency
> sb_writers->mmap_sem. But there are other places where this dependency is
Independent of this issue, I think that moving the retrieval of name and
value out of the lock is a good thing. The commit message might need to
get reworded of course.
> created, in particular write(2) path is a place where it would be very
> difficult to get rid of it (you take sb_writers, then do all the work
> preparing the write and then you copy user data into page cache which
> may require mmap_sem).
>
> But looking at the lockdep splat below:
>
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.10.0-build2+ #956 Not tainted
> > ------------------------------------------------------
> > fsstress/6050 is trying to acquire lock:
> > ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0
> >
> > but task is already holding lock:
> > ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #4 (&vma->vm_lock->lock){++++}-{3:3}:
> > __lock_acquire+0xaf0/0xd80
> > lock_acquire.part.0+0x103/0x280
> > down_write+0x3b/0x50
> > vma_start_write+0x6b/0xa0
> > vma_link+0xcc/0x140
> > insert_vm_struct+0xb7/0xf0
> > alloc_bprm+0x2c1/0x390
> > kernel_execve+0x65/0x1a0
> > call_usermodehelper_exec_async+0x14d/0x190
> > ret_from_fork+0x24/0x40
> > ret_from_fork_asm+0x1a/0x30
> >
> > -> #3 (&mm->mmap_lock){++++}-{3:3}:
> > __lock_acquire+0xaf0/0xd80
> > lock_acquire.part.0+0x103/0x280
> > __might_fault+0x7c/0xb0
> > strncpy_from_user+0x25/0x160
> > removexattr+0x7f/0x100
> > __do_sys_fremovexattr+0x7e/0xb0
> > do_syscall_64+0x9f/0x100
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > -> #2 (sb_writers#14){.+.+}-{0:0}:
> > __lock_acquire+0xaf0/0xd80
> > lock_acquire.part.0+0x103/0x280
> > percpu_down_read+0x3c/0x90
> > vfs_iocb_iter_write+0xe9/0x1d0
> > __cachefiles_write+0x367/0x430
> > cachefiles_issue_write+0x299/0x2f0
> > netfs_advance_write+0x117/0x140
> > netfs_write_folio.isra.0+0x5ca/0x6e0
> > netfs_writepages+0x230/0x2f0
> > afs_writepages+0x4d/0x70
> > do_writepages+0x1e8/0x3e0
> > filemap_fdatawrite_wbc+0x84/0xa0
> > __filemap_fdatawrite_range+0xa8/0xf0
> > file_write_and_wait_range+0x59/0x90
> > afs_release+0x10f/0x270
> > __fput+0x25f/0x3d0
> > __do_sys_close+0x43/0x70
> > do_syscall_64+0x9f/0x100
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> This is the problematic step - from quite deep in the locking chain holding
> invalidate_lock and having PG_Writeback set you suddently jump to very outer
> locking context grabbing sb_writers. Now AFAICT this is not a real deadlock
> problem because the locks are actually on different filesystems, just
> lockdep isn't able to see this. So I don't think you will get rid of these
> lockdep splats unless you somehow manage to convey to lockdep that there's
> the "upper" fs (AFS in this case) and the "lower" fs (the one behind
> cachefiles) and their locks are different.
>
> > -> #1 (&vnode->validate_lock){++++}-{3:3}:
> > __lock_acquire+0xaf0/0xd80
> > lock_acquire.part.0+0x103/0x280
> > down_read+0x95/0x200
> > afs_writepages+0x37/0x70
> > do_writepages+0x1e8/0x3e0
> > filemap_fdatawrite_wbc+0x84/0xa0
> > filemap_invalidate_inode+0x167/0x1e0
> > netfs_unbuffered_write_iter+0x1bd/0x2d0
> > vfs_write+0x22e/0x320
> > ksys_write+0xbc/0x130
> > do_syscall_64+0x9f/0x100
> > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > -> #0 (mapping.invalidate_lock#3){++++}-{3:3}:
> > check_noncircular+0x119/0x160
> > check_prev_add+0x195/0x430
> > __lock_acquire+0xaf0/0xd80
> > lock_acquire.part.0+0x103/0x280
> > down_read+0x95/0x200
> > filemap_fault+0x26e/0x8b0
> > __do_fault+0x57/0xd0
> > do_pte_missing+0x23b/0x320
> > __handle_mm_fault+0x2d4/0x320
> > handle_mm_fault+0x14f/0x260
> > do_user_addr_fault+0x2a2/0x500
> > exc_page_fault+0x71/0x90
> > asm_exc_page_fault+0x22/0x30
>
> Honza
> --
> Jan Kara <jack at suse.com>
> SUSE Labs, CR
More information about the Linux-erofs
mailing list