[capnproto] Thinking of building a size profiler -- thoughts, ideas?

Discussion:

p***@t.undra.org

2018-05-14 15:46:29 UTC

Hi,
For the project I'm working on I need to distribute some zipped capnproto
data. I'd like the data itself to be fairly small, but in particular I'd
like the result of zipping it to be the smallest I can absolutely make it.

I used to use protocol buffers and implemented a size profiler for those.
It basically traversed the entire structure while keeping track of the path
that led to each point and counted the size of data encountered against a
fixed-size suffix of the path. It was pretty simple but really useful in
identifying where the problem points were. Now I've switched to capnproto
and am considering doing the same for that, possibly as a stand-alone tool
if I have time. I'm assuming it won't be all that hard to do with the
reflection api. The plan then is to use it separately but in particular to
combine it with a zip profiler I already have to find parts of the data
that don't compress well.

My question is, is this something anyone has already done or has thought
about so they have any input into how such a tool should work? Also, I
wonder if this is even something that might be useful to anyone else.

c

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+***@googlegroups.com.
Visit this group at https://groups.google.com/group/capnproto.

p***@t.undra.org

2018-05-22 10:26:02 UTC

Permalink

I've built a proof of concept and it seems to work okay; it's
here: https://github.com/tundra/capnprof.

The one issue I've run into is that in order to figure out how much space
(approximately) a value took up in the zipped data I need to know the
location in memory of all values, so I can map them back to the input. I
haven't been able to figure out a way to get that for the pointer sections
in structs, except by just reading past where I know the data section ends.
That seems a little hacky though -- is there a non-hacky way to do this?

I've attached two small examples of the output, they're profiles of the
same data but one is ordered by zipped and the other by unzipped size
(called weight and bytes respectively). It gives a sense of the kind of
information you can derive. This is transit schedule information and I
found it interesting that placenames dominate the unzipped data, by a fair
margin, whereas in the zipped version it's geographic locations (which rank
6th in the unzipped data) that dominate and placenames compress well enough
to sink to rank 3.

c

Hi,
This sounds neat. I'm not aware of anyone having built such a tool yet.
It should indeed be straightforward using the Dynamic API, or maybe the
"Any" API (AnyPointer/AnyList/AnyStruct), which gives you a lower-level
view of the object tree.
-Kenton

Post by p***@t.undra.org
Hi,
For the project I'm working on I need to distribute some zipped capnproto
data. I'd like the data itself to be fairly small, but in particular I'd
like the result of zipping it to be the smallest I can absolutely make it.
I used to use protocol buffers and implemented a size profiler for those.
It basically traversed the entire structure while keeping track of the path
that led to each point and counted the size of data encountered against a
fixed-size suffix of the path. It was pretty simple but really useful in
identifying where the problem points were. Now I've switched to capnproto
and am considering doing the same for that, possibly as a stand-alone tool
if I have time. I'm assuming it won't be all that hard to do with the
reflection api. The plan then is to use it separately but in particular to
combine it with a zip profiler I already have to find parts of the data
that don't compress well.
My question is, is this something anyone has already done or has thought
about so they have any input into how such a tool should work? Also, I
wonder if this is even something that might be useful to anyone else.
c
--
You received this message because you are subscribed to the Google Groups
"Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at https://groups.google.com/group/capnproto.

'Kenton Varda' via Cap'n Proto

2018-05-30 20:57:56 UTC

Permalink

Post by p***@t.undra.org
https://github.com/tundra/capnprof.

Neat! Let me know if / when you think this should be linked from
capnproto.org. (Probably should have a readme first. :) )

Post by p***@t.undra.org
The one issue I've run into is that in order to figure out how much space
(approximately) a value took up in the zipped data I need to know the
location in memory of all values, so I can map them back to the input. I
haven't been able to figure out a way to get that for the pointer sections
in structs, except by just reading past where I know the data section ends.
That seems a little hacky though -- is there a non-hacky way to do this?

I don't think there is. But it seems pretty safe to assume that if the
pointer section has non-zero size, then it starts immediately after the
data section and that each pointer is 8 bytes.

-Kenton