[PATCH 1/2] Add character literal parsing in bytestrings

Thu Jul 21 15:19:03 EST 2011

On Wed, Jul 20, 2011 at 09:50:43AM -0700, Anton Staaf wrote:
> On Wed, Jul 20, 2011 at 6:40 AM, David Gibson
> <david at gibson.dropbear.id.au> wrote:
> > On Thu, Jun 23, 2011 at 04:20:38PM -0700, Anton Staaf wrote:
> >> This adds support for parsing simple (non-escaped) 'x' character
> >> literal syntax in bytestrings.  For example:
> >>
> >>     property = ['a' 2b 'c'];
> >>
> >> is equivalent to:
> >>
> >>     property = [61 2b 62];
> >
> > Hrm.  I like the idea of being able to encode character literals.
> > However I'm dubious as to whether the bytestring syntax is the right
> > place to encode them.
> >
> > Bytestrings are quite lexically strange, they are quite different from
> > the < ... > cell syntax: the things inside default to hex, and spacing
> > is irrelevant ([abcd] is equivalent to [ab cd], [a bc] is a syntax
> > error and *not* equivalent to [0a bc]).  This makes me worry about
> > possible ambiguities or other parsing problems if we put something
> > other than exactly 2 digit hex bytes in there - not that I can see any
> > definite ambiguities in this proposal.
> 
> As you point out below, the < ... > syntax doesn't permit byte values
> (a cell is 32 bits).  So using the cell list syntax would create a lot
> of wasted space.  Especially in my use where I need to create four 128
> byte tables for keyboard scan code mapping.  It would end up wasting
> >1KB.

I certainly wasn't suggesting using padding.  Apart from the wasted
space, it wouldn't let you use it for an already defined binding which
lacks the padding.

>  Adding cell size control syntax would certainly solve that
> problem.  Is this something your interested in pursuing at this time,
> I'd be happy to help with that instead of continuing to push this.

Well, to be honest I'd love to have this syntax several years ago :).
The implementation should be almost trivial, really the only stumbling
block is finding a syntax which is unambiguous, won't cause parsing
oddities and obeys the principle least surprise as best we can.

> Alternatively, I think it is clear that there are no problems parsing
> out the character literals.  Mainly because the ' character is unique
> and will never otherwise occur as a character in a byte literal
> declaration.  The occurrence or lack there of of white space should
> also not be a problem, since the character literal parsing is of a
> fixed length, thus there is no possibility for an ambiguous use such
> as ' ab '.  Also, the invalid use [a bc] is still invalid with
> character literals added, for example [a 'b'] or [a'b'] are both
> invalid because the existing bytestring regex only matches two hex
> characters in a row, and the new character literal regex only matches
> a single character bounded by single quotes.  So neither regex will
> match the lone a character and parsing will fail there.

That's true.  Consider me about 40% persuaded :).

Ok, here's what I suggest.  For now, can you create a patch which
recognizes the character construct syntax in the lexer (including
escapes), and allows its use in cell context.  That won't actually do
what you want, but it gets a fair chunk of the code in a testable,
upstreamable form without making syntax changes I'm uncomfortable
with.

While we're getting that merged we can debate which/how to proceed
with either variable size cell syntax or allowing the character
literals in bytestring context.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson