Given that a process (tree) nowadays can be suspended, checkpointed to disk, and then resumed (using criu or OpenVZ on Linux for instance) without the need to rewrite the software, I'm wondering how this could be leveraged together with a (de-duplicating) compressor to more efficiently compress an ongoing stream of related or associated input. For example, suppose one generates similar documents, source code and object code that is likely to compress well when it's considered as a whole (especially over the course of months), but when compression is done individually or incrementally, the ratio is not as good. I'm wondering about the following scenario:
- Data is provided as input to a long-running compression process that has bounded overall memory usage (for example, by design the compressor might be known to be capped at 4GB of RAM - unless there's a memory leak). This may be done in pieces by working with temporary files for each subsequent input run, and letting the compress expect to receive as input a large list of files to compress. The list would be "consumed" in pieces (say, xaa, xab, ...) as described below. These pieces might be incremental tarfiles with de-duplication post-processing.
- For each input data unit (starting with xaa), the output generated by the compressor for that run (call it xaa.out, xab.out, and so on - or an artificial split of a single append-only output file) is what's written offsite for the purposes of long-term archiving. So the compression of a data unit "absorbs" the input, and "emits" the corresponding compressed input. That makes it necessarily coupled to all previously written data, so that one can't just pluck out a subset without restoring everything thus far. That has a benefit and a downside. But disaster (and thus the need for recovery) is deemed to be rare so that the extra overhead is acceptable. For this approach to work, it is vital to have the compressor write enough of its output so that the input data unit is guaranteed to be recoverable (the alternative is to have the output "lag" until the next natural write point for a given compressor, or to include the checkpointed state as part of what's written, which is enough to enable later emission of all prior input).
- The process tree is suspended and the state of the compressor is written. Only one version of the latest state needs to be kept, and this is only to enable further blocks to be written if the local copy of the state is lost (or as mentioned above, it could be kept around to ensure that input which has not completely been sync'd to output could still be eventually written when combined with the state in memory). When compression has been going on for some time, the total stored information should eventually dwarf the amount of memory that needs to be checkpointed and included for each step.
- When a new data unit (call it xab and let it refer to, say , a new daily/weekly backup, and so on) is available, the state is fetched from the remote if needed (or retrieved locally) and then the process tree is resumed. The compressor continues just as before, expecting input from this file (xab), and then dumping incremental data when done in an iterated fashion.
- This continues any time there is new data to be processed. Also, a verification data set could be stored that reconfirms that the incremental decompression succeeds and thus that the data continues to be provably recoverable in its entirety. This might require a set of checkpointed states for decoding, needed to ensure full fidelity.
I understand that the usual route is to use de-duplication to look for fixed or variable identical blocks and then to individually compress the de-duplicated data (which may optionally make use of a persistent index). But certain kinds of compressors might be able to well adapt to the data (keeping in-memory dictionaries, making better probability estimates over time, training networks) in such a way that incremental and rolling use is more efficient when operating on an entire data set compared to in parts. Again the assumption is that disaster and data recovery are rare and so it's considered okay to have to decompress the entire stream up to this point. And a traditional de-duplication step could still be used as a preprocessor - either an individual de-duplication or an incremental de-dup as a function of a persistent local index. The de-duplication step would still be important, since its bounded state can only capture a subset of the expected input properties as the input grows over time. But at the same time I'm wondering what kinds of compressors could be best coupled with this approach. The checkpointed state could also be forked to adapt to diverging but still somewhat related input sets. This allows for an automatic preseed (in spirit), just like a subset of compressors now allow. But I believe it automatically might incorporate some of the efficiencies of incremental updates, just like bsdiff tries to efficiently encode differences in files. Certain kinds of compressors would work well with this, and certain others would not. Feel free to share thoughts and feedback as I'm curious to know how an OS-level technology might be used to more efficiently compress long-term total archives of related input data in a way that might outperform de-duplication plus compression in pieces.