[ccan] AVR, hashtable and general containers...

Sat Oct 23 13:18:12 EST 2010

On Fri, 22 Oct 2010 05:19:09 pm Joseph Adams wrote:
> Google searching "avl tree implementation" yields implementations
> cluttered with commentary, redundant functions for the sake of
> pedagogy (e.g. separate RotateLeft and RotateRight functions), and/or
> syntax that doesn't work with a C compiler.  My goal was to write a
> sorted container suitable for low-memory environments (like a TI-89
> calculator), and while I was at it, an AVL implementation that's
> simple and works.

That's my favorite feature of your avl module :)

> On Thu, Oct 21, 2010 at 8:17 PM, Rusty Russell <rusty at rustcorp.com.au> wrote:
> > One difference between our interfaces is that you chose separate keys
> > and values, and I chose to assume a single object which contains both.  In
> > my experience that's the norm; you have an object you want to put into the
> > hash.  If that's true, then in the other case you could always make a
> > key/value object.
> 
> It sort of depends on what the user needs.  The way some programming
> languages deal with this (such as C++ and Haskell) is by having two
> different libraries for each case.  For just storing keys, it's
> usually called a "set"; for storing keys and values, it's usually
> called a "map".

Pulling me out of my C-thinking rut :)  Let's use that nomenclature.

> I chose to implement the AVL module as a "map" because I think it's
> easier to use a map as a set than it is to use a set as a map.  For
> instance, a map from strings to values can be created with
> avl_new((AvlCompare) strcmp), as opposed to having to wrap the
> string/value pair in a struct and write a compare function as well.

This has made me think very hard.  Thank you.

Firstly, for avl, an extra pointer (as needed for a map) might not matter,
but for a hashtable it doubles the size.

The accepted wisdom in linked lists in C is to embed rather than dissociate
the list element.  This avoids allocations and (sometimes) maintains type
safety.  So let's apply a congruent argument here.

If our objects know they are in a container, then we should not only prefer
sets to maps, but the AVL struct should be placed into the object.  (Linear
hash has the advantage that there is no per-object struct).

The counter-argument is that if you want hashes you know where to find them,
and that this should be a higher-level module.  Since linear hash does so
well at the set model, avl should explicitly be aimed at the "I really want
a generic map".

> I originally wasn't even going to
> have AvlIter, instead having a function called avl_walk that takes a
> callback function.  However, I quickly came to find that using
> avl_walk was really ugly (though slightly less ugly if you use GNU C's
> nested functions), so I implemented a simple iterator system so I
> could write avl_foreach and avl_foreach_reverse macros.

That matches my experience, though I have a preference for open-coded
control structures rather than loops where possible, eg:

	for (iter = avl_iter_fwd(avl); iter.node; avl_next(avl, &iter))

This only works if names are short enough to make a one-liner; you can see
the tricks I used here.

> Perhaps modules named after algorithms (e.g. avl, btree) rather than
> purpose (e.g. hash) should be more concerned with presentation of the
> algorithm for people to learn from rather than features and uniformity
> (though they should still be well-tested!).

The simplicity of avl is definitely attractive.  I'm thinking now
that we should avoid gratuitous differences but not aim for identical APIs
at all.

On the question of higher-level interfaces, I started implementing a "poor
man's template" for avl and hashtable; basically a mega-macro
AVL_DEFINE(prefix, ktype, dtype, cmpfn) which declared typesafe wrappers
like so:

	struct avl_##prefix;
	static inline struct avl_##prefix *avl_##prefix##_new(void)
	{
		// Check type of cmpfn is correct...
		(void)sizeof((cmpfn)((const ktype *)NULL, (const ktype *)NULL));
		return (struct avl_##prefix *)avl_new(cmpfn);
	}
	...

If such a thing becomes the norm, I think it should be in a separate header
(perhaps not a separate module) because it's ugly, even if the usage is nicer.

> For a map or set meant to be used the same way C++ programmers use
> std::map and std::set, perhaps those CCAN modules should be called
> "map" and "set" and be filled with all the bells and whistles.  The
> ideal map/set would probably be a balanced binary tree based on
> red-black trees and would feature insertion/deletion-safe iterators
> like I described above.  Of course, just copying the AVL code and
> adding bells/whistles would be fine too, but I won't have time to do
> that for a while.

Do we really want to go there?  I think Judy trees would be optimal for
the map case (though benchmarking it vs. bitstealing hash is on my TODO),
though I don't want to take other's packaged and supported libraries and
drag them into CCAN.

Thanks,
Rusty.