[2.4] [PATCH] hash_page rework, take 2
Olof Johansson
olof at austin.ibm.com
Fri Feb 6 04:46:51 EST 2004
Julie DeWandel wrote:
> Hi Olof,
>
> Thank you for the explanations. In most cases, I agree but I still have
> one or two things I wanted to follow up on. I have added my comments to
> yours below.
See below.
> olof at austin.ibm.com wrote:
>> JSD: Better question for (1) is why are interrupts being disabled here?
>> JSD: Can this routine be called from interrupt context?
>>
>> Without disabling interrupts, there's a risk for deadlocks if the
>> processor gets interrupted and the interrupt handler causes a page fault
>> that needs to be resolved Since the lock is held for writing, the handler
>> will wait forever when locking for reading. This is actually similar to
>> the original deadlock that this whole patch is meant to remove, but the
>> window is really small (just a few instructions) now. Likewise an
>> interrupt on a different processor is not a problem since forward
>> progress
>> is still guaranteed on the processor holding it for writing so the reader
>> will eventually get the lock.
>>
> So it is true that an interrupt handler can cause a page fault? Can you
> provide me with an example?
Ben explained this pretty well yesterday. What we saw was mostly in
drivers loaded as modules when they happened to interrupt a processor
currently executing in vmalloc(). Stacks could look like:
c0000000c961f170 c00000000049bc30 __ex_table ÄkernelÜ 0x3c30
c0000000c961f200 c00000000000b544 .do_hash_page_DSI ÄkernelÜ 0x10
c0000000c961f4f0 c0000000c961f580 xfrm_policy_list_Rsmp_cdacf85e ÄÜ
0xc8d6dd28
c0000000c961f580 d000000000055f64 .tux_data_ready ÄtuxÜ 0x78
c0000000c961f620 c000000000294680 .tcp_data_queue ÄkernelÜ 0x7d0
c0000000c961f6e0 c000000000295b10 .tcp_rcv_established ÄkernelÜ 0x2b8
c0000000c961f7a0 c0000000002a1288 .tcp_v4_do_rcv ÄkernelÜ 0x1b0
c0000000c961f830 c0000000002a1a48 .tcp_v4_rcv ÄkernelÜ 0x7ac
c0000000c961f8d0 c00000000027b034 .ip_local_deliver_finish ÄkernelÜ 0x14c
c0000000c961f960 c00000000027aba4 .ip_local_deliver ÄkernelÜ 0x60
c0000000c961f9e0 c00000000027b4cc .ip_rcv_finish ÄkernelÜ 0x34c
c0000000c961fa80 c00000000027ae04 .ip_rcv ÄkernelÜ 0x20c
c0000000c961fb20 c00000000025a26c .netif_receive_skb ÄkernelÜ 0x240
c0000000c961fbc0 c00000000017bb7c .e1000_clean_rx_irq ÄkernelÜ 0x394
c0000000c961fcd0 c00000000017b378 .e1000_clean ÄkernelÜ 0x7c
c0000000c961fd90 c00000000025a7a0 .net_rx_action ÄkernelÜ 0x15c
c0000000c961fe50 c000000000066c40 .do_softirq ÄkernelÜ 0x1b4
c0000000c961ff00 c000000000011fec .do_IRQ ÄkernelÜ 0x164
c0000000c961ff90 c00000000000ae20 HardwareInterrupt_entry ÄkernelÜ 0x38
c00000055c3f6a80 c000000000179324 .e1000_xmit_frame ÄkernelÜ 0x1e8
c00000055c3f6d70 c00000000000aaa8 DataAccessSLB_common ÄkernelÜ 0x108
c00000055c3f6e30 c000000000090d2c .__vmalloc ÄkernelÜ 0x1a4
c00000055c3f6f00 c0000000000d284c .alloc_fd_array ÄkernelÜ 0x4c
c00000055c3f6f80 c0000000000d297c .expand_fd_array ÄkernelÜ 0x9c
c00000055c3f7030 c00000000005a6f0 .copy_files ÄkernelÜ 0x214
c00000055c3f70f0 c00000000005adbc .copy_process ÄkernelÜ 0x440
c00000055c3f71c0 c00000000005b838 .do_fork ÄkernelÜ 0x40
c00000055c3f7280 c000000000014100 .sys_clone ÄkernelÜ 0x9c
c00000055c3f7320 c000000000029944 .sys32_clone ÄkernelÜ 0x28
c00000055c3f7390 c00000000000fe48 ret_from_syscall_1
exception: c00 (System Call) regs c00000055c3f7400
c000000000017020 .arch_kernel_thread ÄkernelÜ 0x24
c00000055c3f76f0 c0000005599b9000 xfrm_policy_list_Rsmp_cdacf85e ÄÜ
0x591077a8
c00000055c3f7780 d000000000073610 .start_external_cgi ÄtuxÜ 0x2c
c00000055c3f7800 d00000000007368c .query_extcgi ÄtuxÜ 0x18
c00000055c3f7880 d0000000000602e0 .http_process_message ÄtuxÜ 0x2e4
c00000055c3f7910 d0000000000571fc .tux_schedule_atom ÄtuxÜ 0x40
c00000055c3f7990 d0000000000587d8 .process_requests ÄtuxÜ 0x14c
c00000055c3f7a30 d0000000000672d8 .event_loop ÄtuxÜ 0x1a0
c00000055c3f7ad0 d00000000006977c .__sys_tux ÄtuxÜ 0x3c8
c00000055c3f7bc0 c00000000024de0c .sys_tux ÄkernelÜ 0x17c
c00000055c3f7c60 c00000000000fe48 ret_from_syscall_1
>> This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
>> another light-weight syncronization method.
>>
> Let me see if I understand this. When someone wants to free a page
> pointed to by an entry in a 3rd level page table, they clear out the pte
> in the page table using pte_clear(). Then they call pte_free with the
> address of the page they are freeing up (not really a page table entry
> but the actual page address). This page address is added to the batch
> list. Later, the idle loop or process termination code calls
> do_check_pgt_cache which will free all the pages in the batch list.
>
> However, prior to freeing the pages, pte_free_batch() will call
> pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks
> aren't held on any processor. If a processor is holding it, it could be
> the case that the processor is currently walking the page tables and
> might have loaded, for example, the address of a pmd that we have since
> cleared. Since it is still using the data in that page, we don't want to
> free it until they are done with it. If we wait until they drop the
> lock, they are done and we also know any new reference will see the
> cleared value.
>
> Is this correct?
Yes.
>> void
>> local_flush_tlb_mm(struct mm_struct *mm)
>> {
>> - spin_lock(&mm->page_table_lock);
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&mm->page_table_lock, flags);
>>
>> if ( mm->map_count ) {
>> struct vm_area_struct *mp;
>>
> JSD: I believe you said the _irqsave wasn't needed here so this hunk and
> the next JSD: one can be removed.
Grmbl. I'll make sure it's gone before I push a fix.
>> +/* Use the PTE functions for freeing PMD as well, since the same
>> + * problem with tree traversals apply. Since pmd pointers are always
>> + * virtual, no need for a page_address() translation.
>> + */
>> + +#define pte_free(pte_page) __pte_free(pte_page)
>>
> JSD: Your original patch defined pte_free to be
> __pte_free(page_address(pte_page))
> JSD: Is the page_address() wrapper no longer necessary?
This is a difference between the RedHat and ames/mainline trees due to
quicklists. In mainline/ames, pte's are just put on free lists, so the
translation isn't needed there. You still need it in your tree.
-Olof
--
Olof Johansson Office: 4F005/905
pSeries Linux Development IBM Systems Group
Email: olof at austin.ibm.com Phone: 512-838-9858
All opinions are my own and not those of IBM
** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
More information about the Linuxppc64-dev
mailing list