Jumbo Frame bug in ibm_newemac driver (was Jumbo Frames, sil24 SATA driver, and kswapd0 page allocation failures)

Tue Oct 27 02:38:39 EST 2009

Okay, I need to revisit this issue.  I have had my time taken away for other things the past couple of months, but I am now back at this network issue.

Here is what I have done:

1. I modified the ibm_newemac driver to follow scatter-gather chains on the RX path.  The idea was to setup the driver to only ever deal with single pages.  The MAL in the PPC only supports data transfers of up to 4080 bytes (less than a single page), so it appears that the hardware should support single page chains.  I set this up just like the e1000 driver.  For whatever reason, this did not work.  It is probably because I do not fully understand the Linux network stack yet (as is apparent in the next iteration).

2. I reverted to the original driver and found that, contrary to what I had thought earlier, the driver does allocate a ring of skbs for use in the driver.  However, when a jumbo packet is received (larger than 4080 bytes) it uses the skb that was pre-allocated for the jumbo packet and allocates a new skb to replace the one in the ring.  This is where the problem is - in that new allocation to replace the one in the stack.  So, to remedy this, I pre-allocated the same number of jumbo skbs for the sole purpose of being used as new skbs for the rx ring.  Here is some code that shows the idea:

Statuc int emaC_open(struct net_device *ndev)
{
	...

        /* Allocate RX ring */
        for (i = 0; i < NUM_RX_BUFF; ++i)
        {
                if (emac_alloc_rx_skb(dev, i, GFP_KERNEL)) {
                        printk(KERN_ERR "%s: failed to allocate RX ring\n",
                               ndev->name);
                        goto oom;
                }

        }

	...
}

static inline int emac_alloc_rx_skb2(struct emac_instance *dev, int slot,
                                    gfp_t flags)
{
        struct sk_buff *skb = dev->rx_skb_pool[slot];
        if (unlikely(!skb))
                return -ENOMEM;

        if(skb_recycle_check(skb, emac_rx_skb_size(dev->rx_skb_size)))
        {
        dev->rx_skb[slot] = skb;
        dev->rx_desc[slot].data_len = 0;

        skb_reserve(skb, EMAC_RX_SKB_HEADROOM + 2);
        dev->rx_desc[slot].data_ptr =
            dma_map_single(&dev->ofdev->dev, skb->data - 2, dev->rx_sync_size,
                           DMA_FROM_DEVICE) + 2;
        wmb();
        dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY |
            (slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);

        return 0;
        }
        else
        {
                printk(KERN_NOTICE "EMAC: SKB not recycleable\n");
                return -ENOMEM;
        }
}

Static int emac_poll_rx(void *param, int budget)
{
	...
	      sg:
                if (ctrl & MAL_RX_CTRL_FIRST) {
                        BUG_ON(dev->rx_sg_skb);
                        if (unlikely(emac_alloc_rx_skb2(dev, slot, GFP_ATOMIC))) {
                                DBG(dev, "rx OOM %d" NL, slot);
                                ++dev->estats.rx_dropped_oom;
                                emac_recycle_rx_skb(dev, slot, 0);
                        } else {
                                dev->rx_sg_skb = skb;
					  emac_recycle_rx_skb(dev,slot,len);
                                skb_put(skb, len);
                        }
                } else if (!emac_rx_sg_append(dev, slot) &&
                           (ctrl & MAL_RX_CTRL_LAST)) {

                        skb = dev->rx_sg_skb;
                        dev->rx_sg_skb = NULL;

                        ctrl &= EMAC_BAD_RX_MASK;
                        if (unlikely(ctrl && ctrl != EMAC_RX_TAH_BAD_CSUM)) {
                                emac_parse_rx_error(dev, ctrl);
                                ++dev->estats.rx_dropped_error;
                                dev_kfree_skb(skb);
                                len = 0;
                        } else {
                        /*      printk(KERN_NOTICE "EMAC: pushing sg packet\n");*/
                                goto push_packet;
                        }
                }
                goto skip;
	...
}

The changes are the allocation of the rx_skb_pool in emac_open(), the function call emac_alloc_rx_skb2() in emac_poll_rx(), and the modifications to emac_alloc_skb to create emac_alloc_rx_skb2.  Also, corresponding allocations for rx_skb_pool are found in emac_resize_rx_ring() for when we need to resize the pool.

Now the problem that I am having is this - the first time through the ring, things work just fine.  But the second time through the loop, the buffers are not cleaned out - they still think they contain data.  I have tried calling skb_recycle_check() to restore the skb to a new state, however that call fails because apparently the skb cannot be reused for receive.  Why is that the case?  What am I missing?  It seems like I am missing something that allows the skb to be reused?

I will admit, I am not a Linux network driver expert, though I am learning.  If anyone can lend any advice or can see a problem in my logic, then please let me know.

Thanks!

Jonathan

> > If the hardware supports it, the best way to deal with it is to
> set
> > up
> > the driver so that it only ever deals in single pages.
> 
> I am working on fixing the driver to support NETIF_F_SG and have
> changed how it receives packets to follow how the e1000 driver does
> it.
> 
> Here is where I am at:
> 
> When I get the first part of the frame, I allocate an skb for the
> packet.  I call dev->page = alloc_page(GFP_ATOMIC) to allocate a
> page for the 4080 bytes coming from the MAL.
> 
> I then setup a DMA mapping for that page to get the data out of the
> MAL (the original code simply used dma_map_single, but I need a
> page).
> 
> Once the DMA map has been setup and data transferred, I call
> skb_fill_page_desc() to put the data into the skb.  I then wrote a
> function called emac_consume_page, which unmaps the DMA mapping,
> frees the page, and updates the lengths in the skb.
> 
> The relevant source code is at the end of this email.
> 
> My problem is this:
> 
> When I run this code, it appears to create the fragmented packet
> just fine, but when it passes it up the stack, the kernel spits out
> these bugs, one after another:
> 
> BUG: Bad page state in process swapper  pfn:0ee9b
> page:c051f360 flags:(null) count:-3 mapcount:0 mapping:(null)
> index:766
> Call Trace:
> [c032bc30] [c0006ef0] show_stack+0x44/0x16c (unreliable)
> [c032bc70] [c006c438] bad_page+0x94/0x130
> [c032bc90] [c006d4a0] get_page_from_freelist+0x458/0x4d4
> [c032bd20] [c006d5f4] __alloc_pages_nodemask+0xd8/0x4f8
> [c032bda0] [c01a1174] emac_poll_rx+0x300/0x9c8
> [c032bdf0] [c019cb64] mal_poll+0xa8/0x1ec
> [c032be20] [c01cf218] net_rx_action+0x9c/0x1b4
> [c032be50] [c0039678] __do_softirq+0xc4/0x148
> [c032be90] [c0004d18] do_softirq+0x78/0x80
> [c032bea0] [c0039264] irq_exit+0x64/0x7c
> [c032beb0] [c0005210] do_IRQ+0x9c/0xb4
> [c032bed0] [c000fa7c] ret_from_except+0x0/0x18
> [c032bf90] [c000808c] cpu_idle+0xdc/0xec
> [c032bfb0] [c00028fc] rest_init+0x70/0x84
> [c032bfc0] [c02e0864] start_kernel+0x240/0x2c4
> [c032bff0] [c0002254] start_here+0x44/0xb0
> BUG: Bad page state in process swapper  pfn:0ee8c
> page:c051f180 flags:(null) count:-3 mapcount:0 mapping:(null)
> index:757
> Call Trace:
> [c032bc30] [c0006ef0] show_stack+0x44/0x16c (unreliable)
> [c032bc70] [c006c438] bad_page+0x94/0x130
> [c032bc90] [c006d4a0] get_page_from_freelist+0x458/0x4d4
> [c032bd20] [c006d5f4] __alloc_pages_nodemask+0xd8/0x4f8
> [c032bda0] [c01a1174] emac_poll_rx+0x300/0x9c8
> [c032bdf0] [c019cb64] mal_poll+0xa8/0x1ec
> [c032be20] [c01cf218] net_rx_action+0x9c/0x1b4
> [c032be50] [c0039678] __do_softirq+0xc4/0x148
> [c032be90] [c0004d18] do_softirq+0x78/0x80
> [c032bea0] [c0039264] irq_exit+0x64/0x7c
> [c032beb0] [c0005210] do_IRQ+0x9c/0xb4
> [c032bed0] [c000fa7c] ret_from_except+0x0/0x18
> [c032bf90] [c000808c] cpu_idle+0xdc/0xec
> [c032bfb0] [c00028fc] rest_init+0x70/0x84
> [c032bfc0] [c02e0864] start_kernel+0x240/0x2c4
> [c032bff0] [c0002254] start_here+0x44/0xb0
> 
> I know that I am missing something when it comes to allocating the
> pages for the fragments, but when I compare my methodology to the
> e1000 driver, they appear to be functionally the same?
> 
> Any ideas?  I can send the entire source file for the driver if
> needs be.
> 
> Thanks!
> 
> Jonathan
> 
> 
> Here is the source:
> 
> static int emac_poll_rx(void *param, int budget)
> {
> 
> ... /* Other code is here */
> 
> push_packet:
> 	skb->dev = dev->ndev;
> 	skb->protocol = eth_type_trans(skb, dev->ndev);
> 	emac_rx_csum(dev, skb, ctrl);
> 
> 	if (unlikely(netif_receive_skb(skb) == NET_RX_DROP))
> 		++dev->estats.rx_dropped_stack;
> next:
> 	++dev->stats.rx_packets;
> skip:
> 	dev->stats.rx_bytes += len;
> 	slot = (slot + 1) % NUM_RX_BUFF;
> 	--budget;
> 	++received;
> 	continue;
> sg:
> if (ctrl & MAL_RX_CTRL_FIRST) {
> 	BUG_ON(dev->rx_sg_skb);
> 	if (unlikely(emac_alloc_rx_skb2(dev, slot, GFP_ATOMIC))) {
> 		DBG(dev, "rx OOM %d (%d) (%d)" NL, slot, dev-
> >rx_skb_size, len);
> 		++dev->estats.rx_dropped_oom;
> 		emac_recycle_rx_skb(dev, slot, 0);
> 	} else {
> 		dev->rx_sg_skb = skb;
> 		skb_fill_page_desc(dev->rx_sg_skb, 0, dev->page, 0,
> len);
> 		emac_consume_page(dev, len, slot);
> 		dev->rx_sg_skb->len += ETH_HLEN;
> 	}
> } else if (!emac_rx_sg_append(dev, slot) && (ctrl &
> MAL_RX_CTRL_LAST)) {
> 	skb = dev->rx_sg_skb;
> 	dev->rx_sg_skb = NULL;
> 
> 	ctrl &= EMAC_BAD_RX_MASK;
> 	if (unlikely(ctrl && ctrl != EMAC_RX_TAH_BAD_CSUM)) {
> 		emac_parse_rx_error(dev, ctrl);
> 		++dev->estats.rx_dropped_error;
> 		dev_kfree_skb(skb);
> 		len = 0;
> 	} else
> 		goto push_packet;
> }
> 
> ... /* Other code is here */
> } /* end of emac_poll_rx */
> 
> static inline int emac_alloc_rx_skb2(struct emac_instance *dev, int
> slot,
> 				    gfp_t flags)
> {
> 	struct sk_buff *skb = alloc_skb(242, flags);
> 	if (unlikely(!skb))
> 		return -ENOMEM;
> 
> 
> 	dev->rx_skb[slot] = skb;
> 	dev->rx_desc[slot].data_len = 0;
> 
> 	dev->page = alloc_page(flags);
> 	DBG(dev, "emac_alloc_skb2: page %x" NL, dev->page);
> 	if(unlikely(!dev->page))
> 	{
> 		return -1;
> 	}
> 	dev->rx_desc[slot].data_ptr = dma_map_page(&dev->ofdev->dev,
> dev->page, 0, 4096, DMA_FROM_DEVICE);
> 
> 	wmb();
> 	dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY |
> 	    (slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);
> 
> 	return 0;
> } /* end of emac_alloc_rx_skb2 */
> 
> static inline void emac_consume_page(struct emac_instance* dev, int
> length, int slot)
> {
> 	dma_unmap_page(&dev->ofdev->dev, dev->rx_desc[slot].data_ptr,
> 4096, DMA_FROM_DEVICE);
> 	wmb();
> 	__free_page(dev->page);
> 	dev->page = NULL;
> 	dev->rx_sg_skb->len += length;
> 	dev->rx_sg_skb->data_len += length;
> 	dev->rx_sg_skb->truesize += length;
> }
> 
> static inline int emac_rx_sg_append(struct emac_instance *dev, int
> slot)
> {
> 	if (likely(dev->rx_sg_skb != NULL)) {
> 		int len = dev->rx_desc[slot].data_len;
> 		int tot_len = dev->rx_sg_skb->len + len;
> 
> 		if (unlikely(tot_len + 2 > dev->max_mtu)) {
> 			++dev->estats.rx_dropped_mtu;
> 			dev_kfree_skb(dev->rx_sg_skb);
> 			dev->rx_sg_skb = NULL;
> 		} else {
> 			dev->page = alloc_page(GFP_ATOMIC);
> 			if(unlikely(!dev->page))
> 			{
> 				return -ENOMEM;
> 			}
> 			dev->rx_desc[slot].data_ptr = dma_map_page(&dev-
> >ofdev->dev, dev->page, 0, 4096, DMA_FROM_DEVICE);
> 			dev->rx_desc[slot].data_len = 0;
> 			wmb();
> 			dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY |
> (slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);
> 			skb_fill_page_desc(dev->rx_sg_skb, skb_shinfo(dev-
> >rx_sg_skb)->nr_frags, dev->page, 0, len);
> 			emac_consume_page(dev, len, slot);
> 			return 0;
> 		}
> 	}
> 	emac_recycle_rx_skb(dev, slot, 0);
> 	return -1;
> } /* end of emac_rx_sg_append */
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev