[2.4] [PATCH] hash_page rework, take 2

Olof Johansson olof at austin.ibm.com
Fri Feb 6 04:46:51 EST 2004

Julie DeWandel wrote:
> Hi Olof,
> Thank you for the explanations. In most cases, I agree but I still have
> one or two things I wanted to follow up on. I have added my comments to
> yours below.

See below.

> olof at austin.ibm.com wrote:

>> JSD: Better question for (1) is why are interrupts being disabled here?
>> JSD: Can this routine be called from interrupt context?
>> Without disabling interrupts, there's a risk for deadlocks if the
>> processor gets interrupted and the interrupt handler causes a page fault
>> that needs to be resolved Since the lock is held for writing, the handler
>> will wait forever when locking for reading. This is actually similar to
>> the original deadlock that this whole patch is meant to remove, but the
>> window is really small (just a few instructions) now. Likewise an
>> interrupt on a different processor is not a problem since forward
>> progress
>> is still guaranteed on the processor holding it for writing so the reader
>> will eventually get the lock.
> So it is true that an interrupt handler can cause a page fault? Can you
> provide me with an example?

Ben explained this pretty well yesterday. What we saw was mostly in
drivers loaded as modules when they happened to interrupt a processor
currently executing in vmalloc(). Stacks could look like:

c0000000c961f170  c00000000049bc30  __ex_table ÄkernelÜ 0x3c30
c0000000c961f200  c00000000000b544  .do_hash_page_DSI ÄkernelÜ 0x10
c0000000c961f4f0  c0000000c961f580  xfrm_policy_list_Rsmp_cdacf85e ÄÜ
c0000000c961f580  d000000000055f64  .tux_data_ready ÄtuxÜ 0x78
c0000000c961f620  c000000000294680  .tcp_data_queue ÄkernelÜ 0x7d0
c0000000c961f6e0  c000000000295b10  .tcp_rcv_established ÄkernelÜ 0x2b8
c0000000c961f7a0  c0000000002a1288  .tcp_v4_do_rcv ÄkernelÜ 0x1b0
c0000000c961f830  c0000000002a1a48  .tcp_v4_rcv ÄkernelÜ 0x7ac
c0000000c961f8d0  c00000000027b034  .ip_local_deliver_finish ÄkernelÜ 0x14c
c0000000c961f960  c00000000027aba4  .ip_local_deliver ÄkernelÜ 0x60
c0000000c961f9e0  c00000000027b4cc  .ip_rcv_finish ÄkernelÜ 0x34c
c0000000c961fa80  c00000000027ae04  .ip_rcv ÄkernelÜ 0x20c
c0000000c961fb20  c00000000025a26c  .netif_receive_skb ÄkernelÜ 0x240
c0000000c961fbc0  c00000000017bb7c  .e1000_clean_rx_irq ÄkernelÜ 0x394
c0000000c961fcd0  c00000000017b378  .e1000_clean ÄkernelÜ 0x7c
c0000000c961fd90  c00000000025a7a0  .net_rx_action ÄkernelÜ 0x15c
c0000000c961fe50  c000000000066c40  .do_softirq ÄkernelÜ 0x1b4
c0000000c961ff00  c000000000011fec  .do_IRQ ÄkernelÜ 0x164
c0000000c961ff90  c00000000000ae20  HardwareInterrupt_entry ÄkernelÜ 0x38
c00000055c3f6a80  c000000000179324  .e1000_xmit_frame ÄkernelÜ 0x1e8
c00000055c3f6d70  c00000000000aaa8  DataAccessSLB_common ÄkernelÜ 0x108
c00000055c3f6e30  c000000000090d2c  .__vmalloc ÄkernelÜ 0x1a4
c00000055c3f6f00  c0000000000d284c  .alloc_fd_array ÄkernelÜ 0x4c
c00000055c3f6f80  c0000000000d297c  .expand_fd_array ÄkernelÜ 0x9c
c00000055c3f7030  c00000000005a6f0  .copy_files ÄkernelÜ 0x214
c00000055c3f70f0  c00000000005adbc  .copy_process ÄkernelÜ 0x440
c00000055c3f71c0  c00000000005b838  .do_fork ÄkernelÜ 0x40
c00000055c3f7280  c000000000014100  .sys_clone ÄkernelÜ 0x9c
c00000055c3f7320  c000000000029944  .sys32_clone ÄkernelÜ 0x28
c00000055c3f7390  c00000000000fe48  ret_from_syscall_1
exception: c00 (System Call) regs c00000055c3f7400
                   c000000000017020  .arch_kernel_thread ÄkernelÜ 0x24
c00000055c3f76f0  c0000005599b9000  xfrm_policy_list_Rsmp_cdacf85e ÄÜ
c00000055c3f7780  d000000000073610  .start_external_cgi ÄtuxÜ 0x2c
c00000055c3f7800  d00000000007368c  .query_extcgi ÄtuxÜ 0x18
c00000055c3f7880  d0000000000602e0  .http_process_message ÄtuxÜ 0x2e4
c00000055c3f7910  d0000000000571fc  .tux_schedule_atom ÄtuxÜ 0x40
c00000055c3f7990  d0000000000587d8  .process_requests ÄtuxÜ 0x14c
c00000055c3f7a30  d0000000000672d8  .event_loop ÄtuxÜ 0x1a0
c00000055c3f7ad0  d00000000006977c  .__sys_tux ÄtuxÜ 0x3c8
c00000055c3f7bc0  c00000000024de0c  .sys_tux ÄkernelÜ 0x17c
c00000055c3f7c60  c00000000000fe48  ret_from_syscall_1

>> This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
>> another light-weight syncronization method.
> Let me see if I understand this. When someone wants to free a page
> pointed to by an entry in a 3rd level page table, they clear out the pte
> in the page table using pte_clear(). Then they call pte_free with the
> address of the page they are freeing up (not really a page table entry
> but the actual page address). This page address is added to the batch
> list. Later, the idle loop or process termination code calls
> do_check_pgt_cache which will free all the pages in the batch list.
> However, prior to freeing the pages, pte_free_batch() will call
> pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks
> aren't held on any processor. If a processor is holding it, it could be
> the case that the processor is currently walking the page tables and
> might have loaded, for example, the address of a pmd that we have since
> cleared. Since it is still using the data in that page, we don't want to
> free it until they are done with it. If we wait until they drop the
> lock, they are done and we also know any new reference will see the
> cleared value.
> Is this correct?


>> void
>> local_flush_tlb_mm(struct mm_struct *mm)
>> {
>> -    spin_lock(&mm->page_table_lock);
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&mm->page_table_lock, flags);
>>     if ( mm->map_count ) {
>>         struct vm_area_struct *mp;
> JSD: I believe you said the _irqsave wasn't needed here so this hunk and
> the next JSD: one can be removed.

Grmbl. I'll make sure it's gone before I push a fix.

>> +/* Use the PTE functions for freeing PMD as well, since the same
>> + * problem with tree traversals apply. Since pmd pointers are always
>> + * virtual, no need for a page_address() translation.
>> + */
>> + +#define pte_free(pte_page)      __pte_free(pte_page)
> JSD: Your original patch defined pte_free to be
> __pte_free(page_address(pte_page))
> JSD: Is the page_address() wrapper no longer necessary?

This is a difference between the RedHat and ames/mainline trees due to
quicklists. In mainline/ames, pte's are just put on free lists, so the
translation isn't needed there. You still need it in your tree.


Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/

More information about the Linuxppc64-dev mailing list