[capnproto] Bikeshedding the capnp format

Discussion:

a***@gmail.com

2018-02-27 01:35:45 UTC

Hi,

(In case you think this discussion is inappropriate for this list, feel
free to reply to me privately.)

The capnp serialization format looks complicated and even ugly[1] when
you read the spec; at least that was certainly my impression the first
three or so times I read it. Then I set out to rethink it, and it
turns out that more or less every piece of it exist for some reason,
and some of the solutions are extremely neat. (For example, for 2^32-
byte long segments, a signed displacement needs to be 33 bits long---
and it is! Or rather it is 30 bits and measured in words. Not so in
FlatBuffers.)

Still, after thinking about it for some more time, I came up with a set
of simplifications of the current spec. I don't expect capnp or
anybody else will actually use them, but I think they might be
interesting anyway.

I'll present my logic first, rather than rewriting the spec, so that
it's clear where the decisions come from. I'm keeping most of the
general structure of the capnp encoding in place: storage is still
divided into 64-bit words, and pointers are still generally treated as
a pair 32-bit values ("address" and "length") plus some flags.

(1a) Even if a pointer stores a relative offset (a "displacement"), we
still need to know which segment we're in to follow it in order to
avoid going out of bounds. We might as well just store the address
relative to the beginning of the segment (an "offset"). This means 29
bits for a word-aligned address, freeing up one bit in the address
field in addition to the two occupied by the type.

(1b) Which bit is the spare and which two bits are the type? I will
choose bits 0--1 (LSB) as the type and 2 as the spare. Then the
decoder does not need a barrel shifter to read the fields (or any
shifts at all, in fact). This might come in handy when decoding on an
eight-bit micro, although that's a bit fanciful.

(2a) While the encoder _could_ choose the size of the data and pointer
sections of a struct separately, in reality it does not. (I'm
considering only the normal, zero-copy, arbitrary-initialization order
encoder, not the canonicalizing one.) If you know the schema, you can
compute how many pointers there are in a struct from its total
size. Thus only the total size in words needs to be stored; you can
then allow up to 2^32 bytes per struct and still have three bits free
in the length field [bits 0--2, for the same reason as in (1b)]. If
you can look the number of pointers up in a table, reading any field of
a known struct still takes constant time.

(2b) What if you _don't_ know the schema at all? Well, in this case
the only thing you could be doing is copying the struct, and this means
reading it in full, in linear time. Assuming you know where the
pointer section starts, it is sufficient set aside a bit in each
pointer to indicate if it is the last in the current struct; use the
spare address bit from (1a) for that. The only problem is if the
struct contains no pointers at all; encode that in the spare bits from
(2a). [Alternatively, a struct with no pointers could be encoded
without ambiguity as a list of bytes (4a), but I'm not sure that is a
good idea.]

(2c) There's a final, tricky case: what if you need to read an old
field out of a new struct using an old schema? (This is not explicitly
advertised in current capnp, but the format does make it
possible.) The struct is longer than possible with your current
schema, and the lookup table from (2a) is not sufficient. Well, OK; if
it's so long, you can certainly be sure it contains all the fields that
you do know about. (Barring unknown choices for unions, but these are
not the problem.) The problem is that you can't figure out where the
pointer section ends the way you did in (2b)---you need to do it in
constant time. You need to know, just from looking at the struct size,
where _both_ the data section and the pointer section start. The
solution is simple: encode one of them in reverse order, starting from
the end of the struct. I'll choose the pointer section here.

(3) What do we need to store in a far pointer? Well, we need at least
as much info as is contained in a regular pointer (one word---the
landing pad---including the object offset). We also need to store the
object segment, the landing pad offset and the landing pad
segment. The latter we can also fit in one word---the far pointer---if
(and only if) we restrict ourselves to 2^16 segments, storing both
segment numbers in the length field of the far pointer. This avoids
ugly double-far pointers.

(4a) If lists of structs can never be encoded as lists of scalars
unless each element encodes the first (scalar) field of the struct, the
element size of a list of scalars is always known from the schema, so
we only need to encode the length in bytes. And if we don't know the
schema, we could only be copying the list, so the length in bytes is
again sufficient. This means we can use all 32 bits of the length
field, allowing for 2^32 bytes in a list---the maximum possible given
the segment size. Lists of scalars then take up one pointer type (out
of four).

(4b) The only problem is lists of bits (well, also lists of Void, but I
don't see how these are useful if lists of structs cannot be encoded as
lists of scalars). If we restrict them to at most 2^32 elements as
well, we can encode them in a separate pointer type.

(4c) How do we encode lists of pointers? We have used up all four
available pointer types, but we use the type of structs with a
different value in the spare bits from (2a). The length field then
stores the length of the list in words, which is equal to the number of
elements. [Alternatively, a list of pointers could be encoded without
ambiguity as a struct, but I'm not sure that's a good idea.]

(4c) How do we encode lists of structs? First, the tag word as it
currently exists is unavoidable, because we need to store the size of
each struct somewhere, and that takes at least a halfword. Storing the
number of elements in the tag address is nice, but not necessary,
because the length in words is enough for bounds checking. Second, a
list of structs must be at most 2^29 words long (and a fortiori at most
2^29 elements long) to fit in a segment. Again, we can use the
available encoding space in the struct pointer type. It is also useful
to include the tag word in the lenghth in words, for easier copying.

(5) What about capability references? The only unused encoding space
is in the type of structs and lists of structs, where we have used
three out of eight possible combinations of the bits from (2a).
Allocating a single combination there makes for a capability table of
2^29 entries. I'm not sure I like carving pieces out of the encoding
space like this, though. [This is not entirely true: we've also not
defined the semantics of a struct list with an exotic tag word, and of
the "more flag" (1a) in a pointer list. Not sure what could be the use
of that.]

Let me now describe the format that results from these observations,
without fixing the numerical constants. The pointer formats are:

+--+-+--------------------------+-------------------------------+
|Ty|M| | |
+--+-+--------------------------+-------------------------------+

Ty (2 bits) = type; one of
- "compound pointer"
- "far pointer"
- "list of bits"
- "list of bytes"
M (1 bit ) = more bit; one of
- "last pointer"
- "more pointers"

+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+

Ty ( 2 bits) = "list of bits"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bits in list

+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+

Ty ( 2 bits) = "list of bytes"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bytes in list

+--+-+--------------------------+---+---------------------------+
|Ty|M| Offset |Sub| Length |
+--+-+--------------------------+---+---------------------------+

Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = more bit
Offset (29 bits) = offset of object, in words
Sub ( 3 bits) = subtype; one of
- "struct with no pointers"
- "struct with pointers"
- "list of pointers"
- "list of structs"
Length (29 bits) = length of object, in words

For Sub = "struct with no pointers", the object is contains data from
the struct.

For Sub = "struct with pointers", the object contains data from the
struct at the beginning and pointers from the struct in reverse order
at the end. The pointer at the lowest address has M = "last pointer",
other pointers have M = "more pointers". There must be at least one
pointer.

For Sub = "list of pointers", the object contains pointers. Each
pointer has M = (FIXME?).

For Sub = "list of structs", the object consists of a single tag word
followed by equal-length "structs with no pointers" or "structs with
pointers" encoded back-to-back. The format of the tag word must be:

+--+-+--------------------------+---+---------------------------+
|Ty|M| |Sub| Length |
+--+-+--------------------------+---+---------------------------+

Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = "last pointer in struct"
Sub ( 3 bits) = subtype of each struct; one of
- "struct with no pointers"
- "struct with pointers"
Length (29 bits) = length of each struct, in words

(Or we might want to use the offset field for something.)

Final observation that doesn't really fit anywhere: I don't see why the
suggested framing format for byte streams doesn't encode the segment
list as a capnp list of 32-byte scalars, i.e. an appropriate 64-bit
pointer immediately followed by the segment lengths. (The offset in
the pointer is thus fixed.) The minor 4-byte redundancy can be used as
a magic number and to distinguish between little- and big-endian capnp
in case the latter ever comes to pass.

[1] That's the entire reason for <https://github.com/tailhook/probor/>,
for example, even though implementing CBOR deserialization is in fact
more difficult than capnp deserialization.

Cheers,
Alex

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+***@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.

a***@gmail.com

2018-02-27 01:47:39 UTC

Permalink

Post by a***@gmail.com
Let me now describe the format that results from these observations,

Forgot to include the far pointer format; it is, of course,

+--+-+--------------------------+---------------+---------------+
|Ty|M| Pad offset | Pad segment | Obj segment |
+--+-+--------------------------+---------------+---------------+

Ty ( 2 bits) = "far pointer"
M ( 1 bit ) = more bit
Pad offset (29 bits) = offset of pad, in words
Pad segment (16 bits) = segment of pad
Obj segment (16 bits) = segment of object

The destination object is referred to by the pointer located at (Pad
segment):(Pad offset) (the "landing pad"), with the offset inside the
pointer being interpreted relative to (Obj segment).

Alex

'Kenton Varda' via Cap'n Proto

2018-02-27 18:36:44 UTC

Permalink

Fun stuff.

I, too, am not entirely happy with the Cap'n Proto wire format in
retrospect. Here's what I'd do if compatibility were not an issue:

1) Eliminate the concept of segments from the encoding. Segments need only
exist at write time, to allow for progressive allocation of additional
message space. But, tracking this could be handled entirely in the
MessageBuilder implementation. All the segments could be concatenated at
write time, and the receiver could then treat the message as one large
segment. Inter-segment pointers no longer need to be far pointers; as long
as the distance is under 4GB, a regular relative pointer is fine. (Builder
implementations would recognize cross-segment pointers by the fact that
they fail the bounds check.)

2) All pointers -- including far pointers, if they exist -- should be
relative to the pointer location, never absolute. (Currently, far pointers
contain an absolute segment index.) Making all pointers relative means that
it's always possible to embed an existing message into a larger message
without doing a tree copy, which turns out to be a pretty nifty thing to be
able to do.

3) Recognize that there's no fundamental need to distinguish on the wire
whether a pointer points to a struct or a list. All we really need to know
is the object location, size, and which bits are data vs. pointers. One we
recognize this, the pointer format can instead focus on optimizing the
common case, with fallbacks for less-common cases. The "common" pointer
encoding could be:

1 bit: 1 to indicate this encoding.
31 bits: offset
16 bits: element count
8 bits: data words per element (with special reserved values for 0-bit,
1-bit, 8-bit, 16-bit, 32-bit)
8 bits: pointers per element

This encoding would cover the vast majority of use cases -- including
struct lists without the need for a tag. Note that for a simple struct (not
a list), the element count would be 1. We then add a fallback encoding used
when any of these fields is not large enough. When the first bit is 0, this
indicates an "uncommon pointer", which could be any of:

- Null pointer (all-zero).
- Capability reference.
- Tag pointer: Encodes a 61-bit word offset pointing to a tagged object. A
tagged object starts with a tag word that encodes a 32-bit element count,
16-bit data word per element, 16-bit pointers per element.
- Trampoline pointer: Like today's "double-far" pointer: points to a
two-word object which contains tag information and a further pointer to the
actual object content. Here we can have 2x16-bit segment sizes, 61-bit
offset, and 35-bit element count, which would become the new upper limit on
list sizes (compared to today's 29-bit limit).
- Other pointer types to be defined later.

-Kenton

Post by a***@gmail.com

Post by a***@gmail.com
Let me now describe the format that results from these observations,

Forgot to include the far pointer format; it is, of course,
+--+-+--------------------------+---------------+---------------+
|Ty|M| Pad offset | Pad segment | Obj segment |
+--+-+--------------------------+---------------+---------------+
Ty ( 2 bits) = "far pointer"
M ( 1 bit ) = more bit
Pad offset (29 bits) = offset of pad, in words
Pad segment (16 bits) = segment of pad
Obj segment (16 bits) = segment of object
The destination object is referred to by the pointer located at (Pad
segment):(Pad offset) (the "landing pad"), with the offset inside the
pointer being interpreted relative to (Obj segment).
Alex
--
You received this message because you are subscribed to the Google Groups
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/capnproto.

a***@gmail.com

2018-02-28 21:08:27 UTC

Permalink

Post by 'Kenton Varda' via Cap'n Proto
Fun stuff.

Thanks, glad you liked it :)

Post by 'Kenton Varda' via Cap'n Proto
1) Eliminate the concept of segments from the encoding. Segments need
only exist at write time, to allow for progressive allocation of
additional message space. But, tracking this could be handled
entirely in the MessageBuilder implementation. All the segments could
be concatenated at write time, and the receiver could then treat the
message as one large segment. Inter-segment pointers no longer need
to be far pointers; as long as the distance is under 4GB, a regular
relative pointer is fine. (Builder implementations would recognize
cross-segment pointers by the fact that they fail the bounds check.)

Hm. I don't see how you could do this without a full-on O(n)
compaction pass on transmission/write-out. (Though you could argue
that writing out the message is O(n) in the first place, so you don't
use anything.) Even with one, I'm not sure I know how to do it (and
still allow the message to grow arbitrarily in several places).

As far as I understood, what you're describing is like allocating a 4G
byte buffer for the message and reserving some space inside for each
place where the message can grow. Only you don't store the unused
bytes in the buffer and instead use a sort of virtual addressing to map
it to a collection of chunks (the "segments"); you don't transmit
them, either, and instead compact the final message on transmission.
The problem is, if the reserved space runs out, you're stuck, short of
moving the rest of the message out of the way.

Post by 'Kenton Varda' via Cap'n Proto
2) All pointers -- including far pointers, if they exist -- should be
relative to the pointer location, never absolute. (Currently, far
pointers contain an absolute segment index.) Making all pointers
relative means that it's always possible to embed an existing message
into a larger message without doing a tree copy, which turns out to
be a pretty nifty thing to be able to do.

Oh, I didn't think of that. Yes, that probably makes relative pointers
worth keeping. You still need to get the message data into the buffer,
and that's still O(n) [though a much faster O(n) than a tree traversal]
unless you use something like scatter/gather, which is not exactly a
standard feature of any of the current capnp libraries (or popular
operating systems for that matter).

Post by 'Kenton Varda' via Cap'n Proto
3) Recognize that there's no fundamental need to distinguish on the
wire whether a pointer points to a struct or a list. All we really
need to know is the object location, size, and which bits are data
vs. pointers.

I thought about that too when writing the original email (and mentioned
it in a parenthetical remark), but I wasn't quite sure not
distinguishing between an object and a list of pointers was safe wrt
schema evolution.

Post by 'Kenton Varda' via Cap'n Proto
[...]
16 bits: element count
8 bits: data words per element (with special reserved values for 0-bit, 1-bit, 8-bit, 16-bit, 32-bit)
8 bits: pointers per element

I know I ramble a lot, but if you can stand it, my original email
describes why you don't actually need a pointer count, provided you can
spare a bit in each pointer and a lookup table with 32 bits per struct
field in the decoder for every struct you'll want to read from. (Well,
OK, you need a "has pointers?" flag, but that's it.)

Also, if you store a byte count instead of an element count, you don't
need to distinguish between arrays of 8-, 16-, 32- and 64-bit scalars:
if you're reading, you can look it up in the schema. And if you're
only copying, you don't care what the element size is, only how many
bytes the whole thing takes.

The encoding spec mentions you thought about storing nested arrays---
did this turn out to be unnecessary?

Post by 'Kenton Varda' via Cap'n Proto
[...] When the first bit is 0, this indicates an "uncommon pointer",
[...]
- Trampoline pointer: Like today's "double-far" pointer: points to a
two-word object which contains tag information and a further pointer
to the actual object content. Here we can have 2x16-bit segment
sizes, 61-bit offset, and 35-bit element count, which would become
the new upper limit on list sizes (compared to today's 29-bit limit).

Do you actually use >4G arrays or even messages in practice? It seems
like they inevitably complicate the encoding for little benefit, though
I see how the FlatBuffer game devs would disagree.

Alex

'Kenton Varda' via Cap'n Proto

2018-03-01 02:25:36 UTC

Permalink

Post by a***@gmail.com
Hm. I don't see how you could do this without a full-on O(n)
compaction pass on transmission/write-out. (Though you could argue
that writing out the message is O(n) in the first place, so you don't
use anything.) Even with one, I'm not sure I know how to do it (and
still allow the message to grow arbitrarily in several places).

I don't understand. What I described shouldn't change write logistics.

Think of it this way: We're saying that far pointers can instead be
represented as regular pointers that happen to have an out-of-bounds
offset. Previously, such pointers were simply invalid. Now, we interpret
them as pointing into the next segment.

That said, given that segment boundaries no longer matter, it might make
sense to replace the segment table with a single 64-bit size -- or in the
case of files (where the filesystem tracks the size), no prefix is needed
at all.

Oh, I didn't think of that. Yes, that probably makes relative pointers

Post by a***@gmail.com
worth keeping. You still need to get the message data into the buffer,
and that's still O(n) [though a much faster O(n) than a tree traversal]
unless you use something like scatter/gather, which is not exactly a
standard feature of any of the current capnp libraries (or popular
operating systems for that matter).

Eh? kj::OutputStream and kj::AsyncOutputStream both support gather-writes
and Cap'n Proto uses it when serializing to a stream. This translates into
a writev() syscall.

Moreover, the C++ implementation of Cap'n Proto supports linking a large
external byte blob into a message without copying it. The blob becomes a
message segment.

I actively use this. Sandstorm.io's spk tool, which constructs a
capnp-encoded archive from a directory tree, actually mmap()s each file and
links it into the archive this way. It then writes out the whole archive at
once. All this happens without ever reading the bytes of any of the files
in userspace. In theory a really smart kernel and copy-on-write filesystem
has the information it needs to reflink the underlying bytes, though
admittedly that's unlikely to happen in practice.

Post by a***@gmail.com

I think it only creates new compatibilities, not incompatibilities.
Specifically a struct field would be compatible with a list-of-struct field
so long as the list contained only one element. I'd probably take it a step
further and say that replacing a struct with a list-of-structs is a
supported upgrade, and that old programs will always use the first element
of the list. This is actually a fairly change to want to make in real-world
protocols.

Post by a***@gmail.com

Post by 'Kenton Varda' via Cap'n Proto
[...]
16 bits: element count
8 bits: data words per element (with special reserved values for 0-bit,

1-bit, 8-bit, 16-bit, 32-bit)

Post by 'Kenton Varda' via Cap'n Proto
8 bits: pointers per element

I read that, but I don't like it. I'm uncomfortable with the idea that, to
determine the size of the sections, you either need a lookup table based on
the schema, or you need to do an O(n) scan over the pointers. I'll grant
you that you can make this work for the obvious use cases: copying is O(n)
anyway, and bounds-checking in accessor methods could be based on a check
against the total size rather than a check against the specific location
you're trying to read. But, the code to implement both these cases becomes
significantly more complicated.

I think other, obscure cases may be broken. For example, take the case
where you initialize a builder by deep-copying a message from a reader,
then you traverse that message in the builder. Say you follow some struct
pointer. How do you determine if the target object has a compatible layout?
It could be that the total size is a match, but the section sizes don't
match. You have two options:

1) Scan the pointers to verify there are the correct number. This would be
unacceptably slow -- repeatedly get()ing the same pointer should be fast,
but here we need to do an O(n) scan each time.
2) Just assume that if the struct has the right overall size, then it must
have been created with the same version of the schema and therefore has the
correct section sizes. In this case, though, you now have to account for
the possibility that you'll accidentally try to interpret a pointer as
data, or data as a pointer. You can thus accidentally leak data (by
overwriting a pointer, thus losing the connection to whatever it pointed
at) or encounter invalid pointers (something that's actually not possible
in builders currently).

So no, I don't think eliminating the separate section sizes is feasible.
But greatly reducing their range in the common case makes a lot of sense.
In fact, 4-bit counts would probably be fine in 99% of cases, so the
section sizes could be packed into a single byte if that were necessary.

Post by a***@gmail.com
The encoding spec mentions you thought about storing nested arrays---
did this turn out to be unnecessary?

I don't recall ever having found a use for them.

Post by a***@gmail.com
[...] When the first bit is 0, this indicates an "uncommon pointer",

Post by 'Kenton Varda' via Cap'n Proto
[...]
- Trampoline pointer: Like today's "double-far" pointer: points to a
two-word object which contains tag information and a further pointer
to the actual object content. Here we can have 2x16-bit segment
sizes, 61-bit offset, and 35-bit element count, which would become
the new upper limit on list sizes (compared to today's 29-bit limit).

Do you actually use >4G arrays or even messages in practice? It seems
like they inevitably complicate the encoding for little benefit, though
I see how the FlatBuffer game devs would disagree.

I have used them occasionally, such as for the aforementioned Sandstorm
archive format. I do like the idea of designing the format such that it can
handle as large of a message as you could realistically want.

It would be nice, though, if e.g. embedded use cases could assume no
message is that large, and shed the code to handle it. In my proposal, tag
pointers and trampoline pointers could be left out, simplifying code
considerably.

-Kenton

a***@gmail.com

2018-03-01 03:35:00 UTC

Permalink

Post by 'Kenton Varda' via Cap'n Proto

The difference is that previously you could resize a segment without
adjusting any pointers. (Now, pointers into the next segment will
change their targets.) Conversely, if you allocated too big a buffer
for a segment, you could just ignore the extra part. (Now, again, you
have to write/transmit the padding, or pointers into the next segment
will get invalid.)

Post by 'Kenton Varda' via Cap'n Proto

Post by a***@gmail.com
Oh, I didn't think of that. Yes, that probably makes relative pointers
worth keeping. You still need to get the message data into the buffer,
and that's still O(n) [though a much faster O(n) than a tree traversal]
unless you use something like scatter/gather, which is not exactly a
standard feature of any of the current capnp libraries (or popular
operating systems for that matter).

Oh. Sorry. Should've studied the C++ API more carefully. I last did any
C++ before C++0x, so I was too lazy to actually work through the
somewhat unfamilliar style of the code and instead inferred the
interfaces from a skim of the docs and what the format itself permits.
Well, you see how that turned out.

(For some reason I was thinking about Linux-specific asynchronous
scatter/gather I/O, which _is_ a pain to use.)

Alex

PS. I'm still unclear as to what obscure case you were referring to
regarding section sizes, I'll try to understand and reply at a saner
time tomorrow.

'Kenton Varda' via Cap'n Proto

2018-03-01 20:12:32 UTC

Permalink

Post by a***@gmail.com
The difference is that previously you could resize a segment without
adjusting any pointers. (Now, pointers into the next segment will
change their targets.) Conversely, if you allocated too big a buffer
for a segment, you could just ignore the extra part. (Now, again, you
have to write/transmit the padding, or pointers into the next segment
will get invalid.)

The usual usage pattern here is that you allocate space for a segment,
progressively fill it, and when you're out of space, then you allocate the
next segment. Some space might be left over at the end of an allocated
block if the objects didn't fit just right, but that space is not included
in the serialized message.

Under my proposal, this all works the same.

Admittedly, the current implementation is able to "go back" to previous
segments sometimes and fill in space that was left empty before, if the
objects happen to fit right. That's still possible under the new model too,
but harder: the extra space has to be treated as a new segment, which just
happens to be part of the same memory allocation. In most cases it probably
makes more sense to ignore the wasted space; if you allocate each segment
to be twice the size of the last, then the wasted memory will amortize away.

Post by a***@gmail.com
(For some reason I was thinking about Linux-specific asynchronous
scatter/gather I/O, which _is_ a pain to use.)

Eh, it's only a pain in that there's a limit on how many buffers you can
pass per syscall, but I've handled that in the library.

(Windows, OTOH, doesn't seem to support scatter/gather except with some
very finicky requirements that are unrealistic.)

-Kenton