a***@gmail.com
2018-02-27 01:35:45 UTC
Hi,
(In case you think this discussion is inappropriate for this list, feel
free to reply to me privately.)
The capnp serialization format looks complicated and even ugly[1] when
you read the spec; at least that was certainly my impression the first
three or so times I read it. Then I set out to rethink it, and it
turns out that more or less every piece of it exist for some reason,
and some of the solutions are extremely neat. (For example, for 2^32-
byte long segments, a signed displacement needs to be 33 bits long---
and it is! Or rather it is 30 bits and measured in words. Not so in
FlatBuffers.)
Still, after thinking about it for some more time, I came up with a set
of simplifications of the current spec. I don't expect capnp or
anybody else will actually use them, but I think they might be
interesting anyway.
I'll present my logic first, rather than rewriting the spec, so that
it's clear where the decisions come from. I'm keeping most of the
general structure of the capnp encoding in place: storage is still
divided into 64-bit words, and pointers are still generally treated as
a pair 32-bit values ("address" and "length") plus some flags.
(1a) Even if a pointer stores a relative offset (a "displacement"), we
still need to know which segment we're in to follow it in order to
avoid going out of bounds. We might as well just store the address
relative to the beginning of the segment (an "offset"). This means 29
bits for a word-aligned address, freeing up one bit in the address
field in addition to the two occupied by the type.
(1b) Which bit is the spare and which two bits are the type? I will
choose bits 0--1 (LSB) as the type and 2 as the spare. Then the
decoder does not need a barrel shifter to read the fields (or any
shifts at all, in fact). This might come in handy when decoding on an
eight-bit micro, although that's a bit fanciful.
(2a) While the encoder _could_ choose the size of the data and pointer
sections of a struct separately, in reality it does not. (I'm
considering only the normal, zero-copy, arbitrary-initialization order
encoder, not the canonicalizing one.) If you know the schema, you can
compute how many pointers there are in a struct from its total
size. Thus only the total size in words needs to be stored; you can
then allow up to 2^32 bytes per struct and still have three bits free
in the length field [bits 0--2, for the same reason as in (1b)]. If
you can look the number of pointers up in a table, reading any field of
a known struct still takes constant time.
(2b) What if you _don't_ know the schema at all? Well, in this case
the only thing you could be doing is copying the struct, and this means
reading it in full, in linear time. Assuming you know where the
pointer section starts, it is sufficient set aside a bit in each
pointer to indicate if it is the last in the current struct; use the
spare address bit from (1a) for that. The only problem is if the
struct contains no pointers at all; encode that in the spare bits from
(2a). [Alternatively, a struct with no pointers could be encoded
without ambiguity as a list of bytes (4a), but I'm not sure that is a
good idea.]
(2c) There's a final, tricky case: what if you need to read an old
field out of a new struct using an old schema? (This is not explicitly
advertised in current capnp, but the format does make it
possible.) The struct is longer than possible with your current
schema, and the lookup table from (2a) is not sufficient. Well, OK; if
it's so long, you can certainly be sure it contains all the fields that
you do know about. (Barring unknown choices for unions, but these are
not the problem.) The problem is that you can't figure out where the
pointer section ends the way you did in (2b)---you need to do it in
constant time. You need to know, just from looking at the struct size,
where _both_ the data section and the pointer section start. The
solution is simple: encode one of them in reverse order, starting from
the end of the struct. I'll choose the pointer section here.
(3) What do we need to store in a far pointer? Well, we need at least
as much info as is contained in a regular pointer (one word---the
landing pad---including the object offset). We also need to store the
object segment, the landing pad offset and the landing pad
segment. The latter we can also fit in one word---the far pointer---if
(and only if) we restrict ourselves to 2^16 segments, storing both
segment numbers in the length field of the far pointer. This avoids
ugly double-far pointers.
(4a) If lists of structs can never be encoded as lists of scalars
unless each element encodes the first (scalar) field of the struct, the
element size of a list of scalars is always known from the schema, so
we only need to encode the length in bytes. And if we don't know the
schema, we could only be copying the list, so the length in bytes is
again sufficient. This means we can use all 32 bits of the length
field, allowing for 2^32 bytes in a list---the maximum possible given
the segment size. Lists of scalars then take up one pointer type (out
of four).
(4b) The only problem is lists of bits (well, also lists of Void, but I
don't see how these are useful if lists of structs cannot be encoded as
lists of scalars). If we restrict them to at most 2^32 elements as
well, we can encode them in a separate pointer type.
(4c) How do we encode lists of pointers? We have used up all four
available pointer types, but we use the type of structs with a
different value in the spare bits from (2a). The length field then
stores the length of the list in words, which is equal to the number of
elements. [Alternatively, a list of pointers could be encoded without
ambiguity as a struct, but I'm not sure that's a good idea.]
(4c) How do we encode lists of structs? First, the tag word as it
currently exists is unavoidable, because we need to store the size of
each struct somewhere, and that takes at least a halfword. Storing the
number of elements in the tag address is nice, but not necessary,
because the length in words is enough for bounds checking. Second, a
list of structs must be at most 2^29 words long (and a fortiori at most
2^29 elements long) to fit in a segment. Again, we can use the
available encoding space in the struct pointer type. It is also useful
to include the tag word in the lenghth in words, for easier copying.
(5) What about capability references? The only unused encoding space
is in the type of structs and lists of structs, where we have used
three out of eight possible combinations of the bits from (2a).
Allocating a single combination there makes for a capability table of
2^29 entries. I'm not sure I like carving pieces out of the encoding
space like this, though. [This is not entirely true: we've also not
defined the semantics of a struct list with an exotic tag word, and of
the "more flag" (1a) in a pointer list. Not sure what could be the use
of that.]
Let me now describe the format that results from these observations,
without fixing the numerical constants. The pointer formats are:
+--+-+--------------------------+-------------------------------+
|Ty|M| | |
+--+-+--------------------------+-------------------------------+
Ty (2 bits) = type; one of
- "compound pointer"
- "far pointer"
- "list of bits"
- "list of bytes"
M (1 bit ) = more bit; one of
- "last pointer"
- "more pointers"
+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+
Ty ( 2 bits) = "list of bits"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bits in list
+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+
Ty ( 2 bits) = "list of bytes"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bytes in list
+--+-+--------------------------+---+---------------------------+
|Ty|M| Offset |Sub| Length |
+--+-+--------------------------+---+---------------------------+
Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = more bit
Offset (29 bits) = offset of object, in words
Sub ( 3 bits) = subtype; one of
- "struct with no pointers"
- "struct with pointers"
- "list of pointers"
- "list of structs"
Length (29 bits) = length of object, in words
For Sub = "struct with no pointers", the object is contains data from
the struct.
For Sub = "struct with pointers", the object contains data from the
struct at the beginning and pointers from the struct in reverse order
at the end. The pointer at the lowest address has M = "last pointer",
other pointers have M = "more pointers". There must be at least one
pointer.
For Sub = "list of pointers", the object contains pointers. Each
pointer has M = (FIXME?).
For Sub = "list of structs", the object consists of a single tag word
followed by equal-length "structs with no pointers" or "structs with
pointers" encoded back-to-back. The format of the tag word must be:
+--+-+--------------------------+---+---------------------------+
|Ty|M| |Sub| Length |
+--+-+--------------------------+---+---------------------------+
Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = "last pointer in struct"
Sub ( 3 bits) = subtype of each struct; one of
- "struct with no pointers"
- "struct with pointers"
Length (29 bits) = length of each struct, in words
(Or we might want to use the offset field for something.)
Final observation that doesn't really fit anywhere: I don't see why the
suggested framing format for byte streams doesn't encode the segment
list as a capnp list of 32-byte scalars, i.e. an appropriate 64-bit
pointer immediately followed by the segment lengths. (The offset in
the pointer is thus fixed.) The minor 4-byte redundancy can be used as
a magic number and to distinguish between little- and big-endian capnp
in case the latter ever comes to pass.
[1] That's the entire reason for <https://github.com/tailhook/probor/>,
for example, even though implementing CBOR deserialization is in fact
more difficult than capnp deserialization.
Cheers,
Alex
(In case you think this discussion is inappropriate for this list, feel
free to reply to me privately.)
The capnp serialization format looks complicated and even ugly[1] when
you read the spec; at least that was certainly my impression the first
three or so times I read it. Then I set out to rethink it, and it
turns out that more or less every piece of it exist for some reason,
and some of the solutions are extremely neat. (For example, for 2^32-
byte long segments, a signed displacement needs to be 33 bits long---
and it is! Or rather it is 30 bits and measured in words. Not so in
FlatBuffers.)
Still, after thinking about it for some more time, I came up with a set
of simplifications of the current spec. I don't expect capnp or
anybody else will actually use them, but I think they might be
interesting anyway.
I'll present my logic first, rather than rewriting the spec, so that
it's clear where the decisions come from. I'm keeping most of the
general structure of the capnp encoding in place: storage is still
divided into 64-bit words, and pointers are still generally treated as
a pair 32-bit values ("address" and "length") plus some flags.
(1a) Even if a pointer stores a relative offset (a "displacement"), we
still need to know which segment we're in to follow it in order to
avoid going out of bounds. We might as well just store the address
relative to the beginning of the segment (an "offset"). This means 29
bits for a word-aligned address, freeing up one bit in the address
field in addition to the two occupied by the type.
(1b) Which bit is the spare and which two bits are the type? I will
choose bits 0--1 (LSB) as the type and 2 as the spare. Then the
decoder does not need a barrel shifter to read the fields (or any
shifts at all, in fact). This might come in handy when decoding on an
eight-bit micro, although that's a bit fanciful.
(2a) While the encoder _could_ choose the size of the data and pointer
sections of a struct separately, in reality it does not. (I'm
considering only the normal, zero-copy, arbitrary-initialization order
encoder, not the canonicalizing one.) If you know the schema, you can
compute how many pointers there are in a struct from its total
size. Thus only the total size in words needs to be stored; you can
then allow up to 2^32 bytes per struct and still have three bits free
in the length field [bits 0--2, for the same reason as in (1b)]. If
you can look the number of pointers up in a table, reading any field of
a known struct still takes constant time.
(2b) What if you _don't_ know the schema at all? Well, in this case
the only thing you could be doing is copying the struct, and this means
reading it in full, in linear time. Assuming you know where the
pointer section starts, it is sufficient set aside a bit in each
pointer to indicate if it is the last in the current struct; use the
spare address bit from (1a) for that. The only problem is if the
struct contains no pointers at all; encode that in the spare bits from
(2a). [Alternatively, a struct with no pointers could be encoded
without ambiguity as a list of bytes (4a), but I'm not sure that is a
good idea.]
(2c) There's a final, tricky case: what if you need to read an old
field out of a new struct using an old schema? (This is not explicitly
advertised in current capnp, but the format does make it
possible.) The struct is longer than possible with your current
schema, and the lookup table from (2a) is not sufficient. Well, OK; if
it's so long, you can certainly be sure it contains all the fields that
you do know about. (Barring unknown choices for unions, but these are
not the problem.) The problem is that you can't figure out where the
pointer section ends the way you did in (2b)---you need to do it in
constant time. You need to know, just from looking at the struct size,
where _both_ the data section and the pointer section start. The
solution is simple: encode one of them in reverse order, starting from
the end of the struct. I'll choose the pointer section here.
(3) What do we need to store in a far pointer? Well, we need at least
as much info as is contained in a regular pointer (one word---the
landing pad---including the object offset). We also need to store the
object segment, the landing pad offset and the landing pad
segment. The latter we can also fit in one word---the far pointer---if
(and only if) we restrict ourselves to 2^16 segments, storing both
segment numbers in the length field of the far pointer. This avoids
ugly double-far pointers.
(4a) If lists of structs can never be encoded as lists of scalars
unless each element encodes the first (scalar) field of the struct, the
element size of a list of scalars is always known from the schema, so
we only need to encode the length in bytes. And if we don't know the
schema, we could only be copying the list, so the length in bytes is
again sufficient. This means we can use all 32 bits of the length
field, allowing for 2^32 bytes in a list---the maximum possible given
the segment size. Lists of scalars then take up one pointer type (out
of four).
(4b) The only problem is lists of bits (well, also lists of Void, but I
don't see how these are useful if lists of structs cannot be encoded as
lists of scalars). If we restrict them to at most 2^32 elements as
well, we can encode them in a separate pointer type.
(4c) How do we encode lists of pointers? We have used up all four
available pointer types, but we use the type of structs with a
different value in the spare bits from (2a). The length field then
stores the length of the list in words, which is equal to the number of
elements. [Alternatively, a list of pointers could be encoded without
ambiguity as a struct, but I'm not sure that's a good idea.]
(4c) How do we encode lists of structs? First, the tag word as it
currently exists is unavoidable, because we need to store the size of
each struct somewhere, and that takes at least a halfword. Storing the
number of elements in the tag address is nice, but not necessary,
because the length in words is enough for bounds checking. Second, a
list of structs must be at most 2^29 words long (and a fortiori at most
2^29 elements long) to fit in a segment. Again, we can use the
available encoding space in the struct pointer type. It is also useful
to include the tag word in the lenghth in words, for easier copying.
(5) What about capability references? The only unused encoding space
is in the type of structs and lists of structs, where we have used
three out of eight possible combinations of the bits from (2a).
Allocating a single combination there makes for a capability table of
2^29 entries. I'm not sure I like carving pieces out of the encoding
space like this, though. [This is not entirely true: we've also not
defined the semantics of a struct list with an exotic tag word, and of
the "more flag" (1a) in a pointer list. Not sure what could be the use
of that.]
Let me now describe the format that results from these observations,
without fixing the numerical constants. The pointer formats are:
+--+-+--------------------------+-------------------------------+
|Ty|M| | |
+--+-+--------------------------+-------------------------------+
Ty (2 bits) = type; one of
- "compound pointer"
- "far pointer"
- "list of bits"
- "list of bytes"
M (1 bit ) = more bit; one of
- "last pointer"
- "more pointers"
+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+
Ty ( 2 bits) = "list of bits"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bits in list
+--+-+--------------------------+-------------------------------+
|Ty|M| Offset | Length |
+--+-+--------------------------+-------------------------------+
Ty ( 2 bits) = "list of bytes"
M ( 1 bit ) = more bit
Offset (29 bits) = offset of list, in words
Length (32 bits) = number of bytes in list
+--+-+--------------------------+---+---------------------------+
|Ty|M| Offset |Sub| Length |
+--+-+--------------------------+---+---------------------------+
Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = more bit
Offset (29 bits) = offset of object, in words
Sub ( 3 bits) = subtype; one of
- "struct with no pointers"
- "struct with pointers"
- "list of pointers"
- "list of structs"
Length (29 bits) = length of object, in words
For Sub = "struct with no pointers", the object is contains data from
the struct.
For Sub = "struct with pointers", the object contains data from the
struct at the beginning and pointers from the struct in reverse order
at the end. The pointer at the lowest address has M = "last pointer",
other pointers have M = "more pointers". There must be at least one
pointer.
For Sub = "list of pointers", the object contains pointers. Each
pointer has M = (FIXME?).
For Sub = "list of structs", the object consists of a single tag word
followed by equal-length "structs with no pointers" or "structs with
pointers" encoded back-to-back. The format of the tag word must be:
+--+-+--------------------------+---+---------------------------+
|Ty|M| |Sub| Length |
+--+-+--------------------------+---+---------------------------+
Ty ( 2 bits) = "compound pointer"
M ( 1 bit) = "last pointer in struct"
Sub ( 3 bits) = subtype of each struct; one of
- "struct with no pointers"
- "struct with pointers"
Length (29 bits) = length of each struct, in words
(Or we might want to use the offset field for something.)
Final observation that doesn't really fit anywhere: I don't see why the
suggested framing format for byte streams doesn't encode the segment
list as a capnp list of 32-byte scalars, i.e. an appropriate 64-bit
pointer immediately followed by the segment lengths. (The offset in
the pointer is thus fixed.) The minor 4-byte redundancy can be used as
a magic number and to distinguish between little- and big-endian capnp
in case the latter ever comes to pass.
[1] That's the entire reason for <https://github.com/tailhook/probor/>,
for example, even though implementing CBOR deserialization is in fact
more difficult than capnp deserialization.
Cheers,
Alex
--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+***@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+***@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.