[capnproto] About random access for Cap'N proto message

Discussion:

w***@gmail.com

2017-12-05 18:38:31 UTC

Hi,

I am working on a project which is using protobuf to encode/decode
messages. I am evaluating if it is worth to migrate to Cap'N proto. I am
using the Java implementation of
Cap'N. https://github.com/capnproto/capnproto-java

From the documentation, https://capnproto.org/index.html, Random access is
mentioned as a key feature. But I am not able to find any piece of code
example to demonstrate this feature. Am I misunderstanding it? Does "random
access" simply means we can access any field without "deserializing" the
whole message (it actually not serialized at all if not packed)?

What I thought about "random access" is Cap'N is able to read any field
back from disk without loading the whole bunch of message data to memory.
But from the java API implementation (the source code), it seems that it
always read the whole message back to byte buffer, getRoot and then access
any field. So, I guess my understanding is wrong, isn't it?

Our scenario:
Our current protobuf message schema has many fields (~100) with embedded
other messages. The serialized message size varies from hundreds bytes to
tens of kilobytes and a few large messages may over 1 megabytes. We store
the messages in term of compressed byte array to underlying KV store and
read back from KV store, uncompress and then parse to protobuf object.

In this case, do you think it is worth to migrate from protobuf to cap'N ?
If so, how can I benefit from "random access" feature?

Thanks,
Tao

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+***@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.

'Kenton Varda' via Cap'n Proto

2017-12-05 19:08:31 UTC

Permalink

Hi Tao,

You can get random access to files on disk by memory mapping the file. In
Java, you would use FileChannel.map() to get a MappedByteBuffer. You can
then pass that ByteBuffer off to Cap'n Proto and use it like any other
ByteBuffer. The operating system will not actually read in the data from
disk until your program attempts to access the corresponding part of the
MappedByteBuffer, which Cap'n Proto will only do when you invoke the
accessor for a field located there. So, somewhat magically, you get random
access.

Unfortunately, you cannot get random access to compressed data this way,
unless the compression is implemented inside the OS / filesystem. (And most
compression methods are not random-access-friendly anyhow.)

-Kenton

Post by w***@gmail.com
Hi,
I am working on a project which is using protobuf to encode/decode
messages. I am evaluating if it is worth to migrate to Cap'N proto. I am
using the Java implementation of Cap'N. https://github.com/capn
proto/capnproto-java
From the documentation, https://capnproto.org/index.html, Random access
is mentioned as a key feature. But I am not able to find any piece of code
example to demonstrate this feature. Am I misunderstanding it? Does "random
access" simply means we can access any field without "deserializing" the
whole message (it actually not serialized at all if not packed)?
What I thought about "random access" is Cap'N is able to read any field
back from disk without loading the whole bunch of message data to memory.
But from the java API implementation (the source code), it seems that it
always read the whole message back to byte buffer, getRoot and then access
any field. So, I guess my understanding is wrong, isn't it?
Our current protobuf message schema has many fields (~100) with embedded
other messages. The serialized message size varies from hundreds bytes to
tens of kilobytes and a few large messages may over 1 megabytes. We store
the messages in term of compressed byte array to underlying KV store and
read back from KV store, uncompress and then parse to protobuf object.
In this case, do you think it is worth to migrate from protobuf to cap'N ?
If so, how can I benefit from "random access" feature?
Thanks,
Tao
--
You received this message because you are subscribed to the Google Groups
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/capnproto.

w***@gmail.com

2017-12-05 22:27:29 UTC

Permalink

Thanks a lot. I got it. In my case, I will always read the compressed byte
array back from KV store, decompress and then read fields. So, in this
case, "random access" means Cap'N will only create the object of that field
from unpacked message without creating the temp objects of other fields, in
other word, all other fields will still be the flat bytes without any
managed objects created. Is that correct?
Moreover, another question is how to write message in packed format to a
byte array. Because I have to allocate a ByteBuffer will enough capacity to
store the message. But it is not possible to know the packed message size
without packing it first. Currently, I have to allocate with its unpacked
size (computeSerializedSizeInWords * 8), then use a tricky way to trim the
tailing zeros. Do you know if there is any better way to do this?

Thanks,
Tao

Post by w***@gmail.com
Hi,
I am working on a project which is using protobuf to encode/decode
messages. I am evaluating if it is worth to migrate to Cap'N proto. I am
using the Java implementation of Cap'N.
https://github.com/capnproto/capnproto-java
From the documentation, https://capnproto.org/index.html, Random access
is mentioned as a key feature. But I am not able to find any piece of code
example to demonstrate this feature. Am I misunderstanding it? Does "random
access" simply means we can access any field without "deserializing" the
whole message (it actually not serialized at all if not packed)?
What I thought about "random access" is Cap'N is able to read any field
back from disk without loading the whole bunch of message data to memory.
But from the java API implementation (the source code), it seems that it
always read the whole message back to byte buffer, getRoot and then access
any field. So, I guess my understanding is wrong, isn't it?
Our current protobuf message schema has many fields (~100) with embedded
other messages. The serialized message size varies from hundreds bytes to
tens of kilobytes and a few large messages may over 1 megabytes. We store
the messages in term of compressed byte array to underlying KV store and
read back from KV store, uncompress and then parse to protobuf object.
In this case, do you think it is worth to migrate from protobuf to cap'N ?
If so, how can I benefit from "random access" feature?
Thanks,
Tao

'Kenton Varda' via Cap'n Proto

2017-12-06 23:05:06 UTC

Permalink

Post by w***@gmail.com
Thanks a lot. I got it. In my case, I will always read the compressed byte
array back from KV store, decompress and then read fields. So, in this
case, "random access" means Cap'N will only create the object of that field
from unpacked message without creating the temp objects of other fields, in
other word, all other fields will still be the flat bytes without any
managed objects created. Is that correct?

Yes. However, if you're reading *packed* messages, then packed bytes do
need to be unpacked upfront. They are unpacked into another ByteBuffer. No
message objects are created, but this does require reading through all the
bytes.

The memory mapping strategy I described does not work for packed messages.

Post by w***@gmail.com
Moreover, another question is how to write message in packed format to a
byte array. Because I have to allocate a ByteBuffer will enough capacity to
store the message. But it is not possible to know the packed message size
without packing it first. Currently, I have to allocate with its unpacked
size (computeSerializedSizeInWords * 8), then use a tricky way to trim the
tailing zeros. Do you know if there is any better way to do this?

The only way to know the packed size is to actually run the packing
algorithm. You could run the algorithm twice, once where you throw away the
data just to get the size, and then another time to save it. Or, you could
allocate successive buffers on-demand, and then assemble them into one big
buffer at the end. Or, if you're going to write to an OutputStream anyway,
write the bytes to the OutputStream as they are being packed, rather than
packing everything first and writing second.

-Kenton