Unsafe pte_update() in do_page_fault() (4xx and Book-E)

Fri Mar 3 07:26:34 EST 2006

Hi!

For the last couple of days I was debugging rare 

  swap_dup: Bad swap file entry 0x00000080

errors in my custom 2.4 kernel running on 405GPr system.

My current theory is that this error is caused by the special lazy 
dcache/icache flush handling on 4xx and BookE. Because this code in my 
2.4 was actually a backport from 2.6, I think we have a problem in 
current 2.6 as well.

Here is what I think happens. On 4xx/BookE we use execute bit to 
deffer dcache to icache flush, in do_page_fault() we flush page when 
execute trap triggers and enable _PAGE_HWEXEC bit in PTE. 

Unfortunately, we don't lock this PTE and it's possible that after 
pte_present() check but _before_ pte_update() call this particular 
page was purged from the memory, e.g. because of extreme memory 
pressure (of course, I'm assuming enabled preempt). 

If this happens, pte_update() sets _PAGE_HWEXEC bit in just cleared 
PTE. Sometime later, another page fault happens for this page, but 
because of that set bit, pte_none() test in handle_pte_fault() fails, 
and we continue along the wrong path, thinking that this PTE was 
swapped out to the swap file, and this triggers swap_dup error I 
mentioned at the beginning.

_PAGE_HWXEC is 0x200 on 405GPr, and because swap entry is PTE shifted 
2 bits to the right, we get that "0x00000080" value.

Paul, does my theory make any sense? I cannot test 2.6 on our hw. So 
far, after I added additional page_table_lock locking to my 2.4 in 
do_page_fault(), I haven't seen these errors, but it's too early to be 
100% sure :).

I'll make a patch for 2.6 if you think my analysis is correct.

-- 
Eugene