[Cbe-oss-dev] [RFC 4/5] spufs: add kernel support for spu task

Sun Jun 24 23:30:50 EST 2007

On Tue, Jun 12, 2007 at 06:10:12PM +0200, Sebastian Siewior wrote:
> Utilisation of SPUs by the kernel, main implementation. 
> The idea behind this implemtation is that there are single jobs that are
> executed asynchronous on the SPU. The user queues jobs with
> enqueue_for_spu() and gets a callback once the job is completed. The
> function itself does not block, it returns -EINPROGRESS. The job will be
> queued in a linked list (protected by a spinlock, calls from softirq
> context are possible) and the kthread that handles the SPU will be woken
> up.
> The SPU thread takes the first element from the list, and calls the
> enqueue function supplied by the user. The user has now the chance to fill
> the ring buffer entry and set a callback for notification which will be
> called once the SPU code accomplished the task.
> User's enqueue function may put more than one work item on the SPU if it
> is necessary. However the KSPU code ensures only one free slot. If more
> slots are needed, kspu may not dequeue the request from the queue and will
> call again, once another slot is available (user's enqueue function has to
> notice this of course).
> 
> The local store of the SPU is divided into several of areas:
> 
> ------------------ 256kb
> | RB ENTRY       |
> ------------------ Ring buffer entries
> | RB ENTRY       |
> ------------------
> | .........      |
> ------------------
> | RB state       | (consumed + outstanding)
> ------------------
> |    STACK       | Stack growing downwards
> ------------------
> | .......        | stack space
> ------------------
> | data           | 
> ------------------
> | code           |
> ------------------
> | multiplexer    |
> ------------------ 0
> 
> The multiplexor is the replacement for crt0. The entrypoint of SPU side
> code is 0x0. The job of the muliplexor is to determine the address of the
> function that is requsted to proceed the job and call it. If there are no
> further jobs, spu_stop() is executed. The SPU writes something to the
> mailbox once a job is done.
> One Ring buffer entry (RB entry in draft) consits of the function
> identifier and user a defined struct. 
> 
> Signed-off-by: Sebastian Siewior <bigeasy at linux.vnet.ibm.com>
> Index: b/arch/powerpc/platforms/cell/Kconfig
> ===================================================================
> --- a/arch/powerpc/platforms/cell/Kconfig
> +++ b/arch/powerpc/platforms/cell/Kconfig
> @@ -50,6 +50,54 @@ config SPU_FS_64K_LS
>  	  uses 4K pages. This can improve performances of applications
>  	  using multiple SPEs by lowering the TLB pressure on them.
>  
> +config SPU_KERNEL_SUPPORT
> +	tristate "Support for utilisation of SPU by the kernel"
> +	default m

Please don't put in default statement in Kconfig file unless there's
a very good reason for it.

> +	depends on SPU_FS
> +	help
> +	  With this option enabled, the kernel may use a SPU and execute
> +	  code. Right now, only AES supports this.

I'd say scrapt that last sentence.

> +
> +config KSPU_SYNC_MODE
> +	bool "Synchronous kspu interface"
> +	depends on SPU_KERNEL_SUPPORT
> +	default n
> +	help
> +	  The synchronous SPU interface is only for debugging & developing
> +		purpose. The advantage over the asynchron one is that it executes
> +		only one function at a time and misaligned DMA transfers, do not
> +		trigger BUG() but will recover. However printk() or any other
> +		console output is not available. My recommendation is to develop
> +		it in user space and try in kernel with this interface if it has
> +		to work :)

This is just a tiny bit of code.  I'm not sure how important it is for
you debugging, but dependent on that I'd either include it
unconditionally or not all all and leave it to a separate debug-only
patch.

> +choice
> +	prompt "Slots in the ringbuffer"
> +	depends on SPU_KERNEL_SUPPORT
> +	default SPU_RINB_SLOTS_16
> +	help
> +	  This option allows you to determine the number of slots in the
> +	  ringbuffer. More slots means that more work may be queued on
> +	  the SPU before waiting for completion. The number of slots must
> +	  be power of two. Every slot takes approx 128 bytes of memory.

This should at least be a boot/module-load time parameter.

> @@ -0,0 +1,574 @@
> +/*
> + * Helper function for dealing with kspus
> + */

This should have some copyright/author notices.

> +
> +#include <asm/spu_priv1.h>
> +#include <asm/kspu/kspu.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/init_task.h>
> +#include <linux/hardirq.h>
> +#include "spufs.h"
> +#include "kspu_util.h"
> +#include <asm/kspu/merged_code.h>

Please always include linux/*.h first, then asm/*.h then local
headers.

> +struct work_item* get_spu_queue_slot(struct kspu_context *kctx)

Kernel coding style would be

struct work_item *get_spu_queue_slot(struct kspu_context *kctx)

Also please make sure all exported function have a kerneldoc comment
describing them.

> +	ls += sizeof (struct kspu_ring_data);
> +	/* ls points now to the first queue slot */
> +	ls += kctx->spu_code->queue_entr_size * (outstanding & queue_mask);
> +
> +	printk("Return slot %d, at %p\n", (outstanding&queue_mask), ls);

We probably shouldn't have a printk in here, simiarly in all the other
function.  If they're important for your debugging please turn them
into pr_debug statements.

> +int enqueue_for_spu(struct kspu_context *kctx, struct kspu_work_item *work)
> +{
> +	spin_lock_bh(&kctx->queue_lock);
> +	list_add_tail(&work->list, &kctx->work_queue);
> +	spin_unlock_bh(&kctx->queue_lock);
> +	wake_up_all(&kctx->newitem_wq);
> +	return -EINPROGRESS;
> +}
> +EXPORT_SYMBOL_GPL(enqueue_for_spu);

No need for any return value here.

> +static int spufs_run_kernel_spu(void *priv)
> +{
> +	struct kspu_context *kctx = (struct kspu_context*) priv;
> +	struct spu_context *ctx = kctx->spu_ctx;
> +	int ret;
> +	u32 status;
> +	int npc = 0;
> +	int fastpath;
> +	DEFINE_WAIT(wait_for_stop);
> +	DEFINE_WAIT(wait_for_ibox);
> +	DEFINE_WAIT(wait_for_newitem);
> +
> +	spu_enable_spu(ctx);
> +	ctx->event_return = 0;
> +
> +	ret = spu_acquire_runnable(ctx, 0);
> +	if (ret) {
> +		mutex_unlock(&ctx->run_mutex);
> +		printk("could not obtain runable spu: %d\n",ret);
> +		BUG();

Please don't sprincle that many BUG() calls around when there
are error we could handle.

> +		if (unlikely(signal_pending(current))) {
> +			printk("kspu: received signal while waiting for condition...\n");
> +			BUG();
> +		}

This is always called from a kernel thread, right?  In that case signals
are disable and there's no need to check for them.

> +		 */
> +		status = ctx->ops->status_read(ctx);
> +		if ((ctx->spu && ctx->spu->class_0_pending) || status & SPU_STATUS_INVALID_INSTR) {
> +
> +			printk("kspu error, status_register: 0x%08x\n", status);
> +			printk("event return: 0x%08lx, spu's npc: 0x%08x\n", kctx->spu_ctx->event_return,
> +					kctx->spu_ctx->ops->npc_read(kctx->spu_ctx));
> +			BUG();
> +		}

Please make sure lines are not longer than 80 chars.

> +		 * XXX
> +		 * Leave the kernel thread only if:
> +		 * => requsted by kthread_stop()
> +		 * => spu stopped (empty work queue)
> +		 */

Generally a kernel thread should only exit if kthread_stop
was called.  A thread exiting by it's own creates a lot of
problems for the kthread infrastructure.

> +static struct kspu_context *kspu_ctx;
> +extern struct kspu_code single_spu_code;
> +
> +struct kspu_context *get_kspu_ctx(void)
> +{
> +	return kspu_ctx;
> +}
> +EXPORT_SYMBOL_GPL(get_kspu_ctx);

Should there really be only a single kspu context?  I'd rather allow
for multiple of these.

> +#ifndef KSPU_MERGED_CODE_H
> +#define KSPU_MERGED_CODE_H
> +#include <linux/autoconf.h>

autoconf.h is included automatically by the build system.