Write Combining on PowerPC

Mon Dec 13 19:38:03 EST 2004

At 1:23 PM -0800 12/10/04, Kendall Bennett wrote:
>Hi Guys,
>
>We are working on some PowerPC machines and noticed that the boxes don't
>appear to support the equivalent of Write Combining that we get on x86
>boxes. Copies to Video Memory on our Motorola Sandpoint box run about
>10Mb/s, which is terribly, terribly slow! 
>
>Does anyone know if it is possible to do something similar to Write
>Combining for the PowerPC architecture, to speed up CPU access to the
>linear framebuffer? Part of the problem is that for video overlay support
>(not motion compensation) you have to dump the entire YUV frame into
>video memory for the hardware overlay, and even on a 1GHz PPC box playing
>an MPEG2 stream is not possible as X takes up over 80% of the CPU just to
>copy the YUV data to video memory!

1. As a previous poster mentioned many PPCs have write combining but they usually call it store gathering. I was just reading about it in the IBM 970fx.

2. What you need are cache line reads or writes through your bridge to the video memory.

3. If your frame buffer is marked non-cachable, which is the usually case, see if you can set up a second aperture that is cached. Otherwise I don't think the store gatherin will work. I don't know your board or processor but you should experiment with cache modes to see which if any work best.

4. Assuming you can get a cachable aperture you need to remember when writing a complete image to frame buffer memory is that you waste 50% of your bandwidth reading cache lines from the frame buffer into your cache. You can use dcbz to clear a cache line and then write it. This should double your bandwidth to 20 MB/sec.

5. How good is your copy loop? if you have floating point registers you can often use these to increase your efficiency. There may be other ways to make the copy loop more efficient using processor specific instructions that generate more efficient memory loads and stores. Try loop unrolling. Also make sure you prefetch the source using a dcbt or similar instruction. You have to experiment to see how far ahead of needed the data you need to prefecth.

6. Use small test programs to get it right.

7. You don't mention your processor type/speed, bus speeds and memory speed so it's pretty hard to tell what efficiency you might be able to achieve.

8. I make no comment about the efficiency of X. It's not would I would use for video applications although I am sure there are those that have hacked it work there.

Best,

leb
> 
>
>Obviously bus mastering will help solve this problem, but it would be
>better if there was a way to enabling faster CPU access to the
>framebuffer as well. 
>
>Regards,
>
>---
>Kendall Bennett
>Chief Executive Officer
>SciTech Software, Inc.
>Phone: (530) 894 8400
>http://www.scitechsoft.com
>
>~ SciTech SNAP - The future of device driver technology! ~
>
>
>_______________________________________________
>Linuxppc-embedded mailing list
>Linuxppc-embedded at ozlabs.org
>https://ozlabs.org/mailman/listinfo/linuxppc-embedded