Results 1 to 8 of 8

Thread: Riegeli — a new compressor for structured data (protocol buffers)

  1. #1
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts

    Riegeli — a new compressor for structured data (protocol buffers)

    Riegeli is is the latest creation in the gipfeli, zopfli, brotli, butteraugli, and guetzli series. It is a new fast, robust and feature-rich way of compressing protocol buffers. It supports dense compression, fast decoding, seeking, great data integrity, fast filtering, and parallel encoding.

    Protocol buffers
    are a language-neutral, platform-neutral extensible mechanism for serializing structured data -- think XML, but faster and smaller.

    Riegeli can be useful for big data, storing logs, storing heterogeneous application data, mobile, wasm use, game development, possibly as a replacement of json, etc. etc.

  2. The Following 7 Users Say Thank You to Jyrki Alakuijala For This Useful Post:

    Bulat Ziganshin (11th January 2018),Cyan (10th January 2018),encode (11th January 2018),khavish (13th January 2018),load (11th January 2018),Mike (14th January 2018),schnaader (14th January 2018)

  3. #2
    Member
    Join Date
    Nov 2015
    Location
    boot ROM
    Posts
    83
    Thanks
    25
    Thanked 15 Times in 13 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Protocol buffers[/URL] are a language-neutral, platform-neutral extensible mechanism for serializing structured data -- think XML, but faster and smaller.
    There is one big difference between protocol buffers and XML. To parse protocol buffers reasonably one MUST have a priori knowledge of "schema". Protocol Buffers do not address it on their own. XML theoretically lacks this requirement. While most programs would fail to parse "arbitrary" XMLs reasonably, it is still possible to get idea about document structure and so on. Protocol buffers do not allow it to happen on its own as far as I understand. So while I guess everyone would agree XML is horrible to edit, bloated, slow to parse, lack of a priory knowledge of incoming tags size suxx, esp if it happens over network, etc I have to admit one can't really think protocol buffers as XML, they are different. One could parse "unknown" XML at least somehow. Not going to work with protocol buffers. Other than that protocol buffers are looking good though. But their "extensible" comes with a big catch.
    Last edited by xcrh; 12th January 2018 at 07:10.

  4. #3
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    539
    Thanks
    192
    Thanked 174 Times in 81 Posts
    Looks interesting. I came in contact with proto files two times so far: OpenStreetMap maps (PBF format) and machine learning (Tensorflow). Both times I found it interesting trying to compress those files further. Since for most compressors they are just blobs filled with binary data and strings and usually fields aren't fixed size, there's not much gain.

    But knowing the protocol buffers format, it would be possible to parse and restructure the data (sorting it by type, e.g. strings/numbers, or by tags), even without knowledge of the matching .proto definition. It seems that this is something that can be done with Riegeli and I would be interested in integrating this in Precomp. So my questions would be:

    1. Is further compressing protobuf data one of the use cases of Riegeli? The (not yet documented) "Transposed chunk" section in the documentation looks like the restructuring I mentioned above, so it might fit my use case. What I'd try is using riegeli on protobuf data without any compression, compress the result using Precomp (basically LZMA2) and restoring the original protobuf data by extracting and running riegeli on it in the other direction.
    2. How to build it? Some quick research showed the build tool used seems to be Bazel, but I've not tested it yet, so confirmation would be nice - also, you might think about adding this to the README/documentation Edit: Got the Bazel Windows binary now, running "bazel build" in the riegeli directory finds Visual Studio and creates some strange "folders", but doesn't seem to build anything (also, it says "0 targets"). Edit2: "bazel build //riegeli/base:base" builds the base part and creates "libbase.a", so I'm getting closer
    3. What were the reasons to use Brotli and zstd for compression? I would have guessed fast decompression and easy integration, but as I'll most likely use LZMA2, I'd be interested if this has downsides.
    Last edited by schnaader; 14th January 2018 at 13:11.
    http://schnaader.info
    Damn kids. They're all alike.

  5. The Following User Says Thank You to schnaader For This Useful Post:

    Bulat Ziganshin (14th January 2018)

  6. #4
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by xcrh View Post
    To parse protocol buffers reasonably one MUST have a priori knowledge of "schema".
    Could you explain more about this? What kind of parsing is possible on xml and not on protocol buffers?

    The protocol buffers carry a hiearchy of attribute names. Names are small integers. Also, each data type is encoded in the stream. It is possible to for example transpose or filter protocol buffers without a schema.

  7. #5
    Member
    Join Date
    Nov 2015
    Location
    boot ROM
    Posts
    83
    Thanks
    25
    Thanked 15 Times in 13 Posts
    Yes, I can. In XML you could basically write it like this:
    Code:
    <person>
    <name>John><name/>
    <age>20</age>
    </person>
    So we know its something "person" with nested properties "name" = "John" and "age" = 20. As far as I can get, comparable protobuf declaration would give me just "John" and 20 on wire (plus some small extra to delimit two fields). Things like "name" or "age" are lost, in sense they are not transmitted in wire format. So decoder must have a priory knowledge it looks for exactly this kind of thing, not something else. So unless one knows they are looking for name and age there is no way to meaningfully decode John and 20 received on decoder input. Req'd knowledge is not xferred on the wire, it have to be obrained somewhere else, unlike with XML (which could also refer to full blown DTD, etc). Most ironic part? Forum's parser readily downgraded straightforward XML to same "john 20" pretty much in protobuf spirit. So it took some efforts to persuade parser to pike off . Though I would still admit protocol buffers are interesting thing. Bit it is not equivalent of XML. Probably one can get some agreement on top of protobuf, e.g. one can mandate there MUST be both "key" and its "value" and it is a "decode error" otherwise, so both name, age and values are sent over the wire as key-value pairs. But it is not a part of standard encoding rules or so. One can't parse arbitrary protobuf input meaningfully.

  8. #6
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by xcrh View Post
    So we know its something "person" with nested properties ...
    You are right, you definitely lose the human readability of the attribute names. You can still parse the protocol buffers, just you don't have human readable attribute names.

  9. The Following User Says Thank You to Jyrki Alakuijala For This Useful Post:

    Alexander Rhatushnyak (18th January 2018)

  10. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    667
    Thanks
    204
    Thanked 241 Times in 146 Posts
    Quote Originally Posted by schnaader View Post
    Looks interesting. I came in contact with proto files two times so far: OpenStreetMap maps (PBF format) and machine learning (Tensorflow). Both times I found it interesting trying to compress those files further. Since for most compressors they are just blobs filled with binary data and strings and usually fields aren't fixed size, there's not much gain.

    But knowing the protocol buffers format, it would be possible to parse and restructure the data (sorting it by type, e.g. strings/numbers, or by tags), even without knowledge of the matching .proto definition. It seems that this is something that can be done with Riegeli and I would be interested in integrating this in Precomp. So my questions would be:

    1. Is further compressing protobuf data one of the use cases of Riegeli? The (not yet documented) "Transposed chunk" section in the documentation looks like the restructuring I mentioned above, so it might fit my use case. What I'd try is using riegeli on protobuf data without any compression, compress the result using Precomp (basically LZMA2) and restoring the original protobuf data by extracting and running riegeli on it in the other direction.
    2. How to build it? Some quick research showed the build tool used seems to be Bazel, but I've not tested it yet, so confirmation would be nice - also, you might think about adding this to the README/documentation Edit: Got the Bazel Windows binary now, running "bazel build" in the riegeli directory finds Visual Studio and creates some strange "folders", but doesn't seem to build anything (also, it says "0 targets"). Edit2: "bazel build //riegeli/base:base" builds the base part and creates "libbase.a", so I'm getting closer
    3. What were the reasons to use Brotli and zstd for compression? I would have guessed fast decompression and easy integration, but as I'll most likely use LZMA2, I'd be interested if this has downsides.
    1. Riegeli is intended for fast compression and decompression (fast meaning ~100 MB/s compression, 300-1000 MB/s decompression). Because of this it may or may not fit the LZMA2 use case (where you are likely to look into slower and more dense possibilities).

    2. You are the first person in the world to compile it for Windows Congratulations if you are successful!!

    3. Zstd and Brotli give two speed/density compromises -- brotli gives a tiny bit more density for some cost at decompression. LZMA2 probably fits Riegeli fine, but is unlikely to give a lot more savings over Brotli. Perhaps you get 1-2 % more and slow down decoding by a factor of 4x. Still, it can be a good compromise -- all depends on what is valuable to your users. (((and we didn't try LZMA2, could be more savings...))

  11. The Following User Says Thank You to Jyrki Alakuijala For This Useful Post:

    schnaader (18th January 2018)

  12. #8
    Member
    Join Date
    Aug 2017
    Location
    Mauritius
    Posts
    59
    Thanks
    67
    Thanked 22 Times in 16 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Riegeli is is the latest creation in the gipfeli, zopfli, brotli, butteraugli, and guetzli series. It is a new fast, robust and feature-rich way of compressing protocol buffers. It supports dense compression, fast decoding, seeking, great data integrity, fast filtering, and parallel encoding.

    Protocol buffers
    are a language-neutral, platform-neutral extensible mechanism for serializing structured data -- think XML, but faster and smaller.

    Riegeli can be useful for big data, storing logs, storing heterogeneous application data, mobile, wasm use, game development, possibly as a replacement of json, etc. etc.
    Looking at the the trend it seems that an svg successor is next ( protocol buffers + Riegeli)

  13. The Following User Says Thank You to khavish For This Useful Post:

    pothos2 (18th January 2018)

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. Replies: 12
    Last Post: 25th April 2019, 10:23
  3. Reducing buffering and copying between buffers
    By nemequ in forum Data Compression
    Replies: 7
    Last Post: 8th March 2017, 21:53
  4. Kitty file compressor (Super small compressor)
    By snowcat in forum Data Compression
    Replies: 7
    Last Post: 26th April 2015, 16:46
  5. About Ring Buffers
    By Cyan in forum Data Compression
    Replies: 5
    Last Post: 17th November 2009, 18:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •