Results 1 to 9 of 9

Thread: Xz format inadequate for long-term archiving

  1. #1
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts

    Xz format inadequate for long-term archiving

    This has been making the rounds for the past few days: http://www.nongnu.org/lzip/xz_inadequate.html

    I don't necessarily agree with the author about the conclusion but IMHO they do raise some good points. Definitely worth a read if you want to avoid such mistakes in the future.

  2. The Following 2 Users Say Thank You to nemequ For This Useful Post:

    Bulat Ziganshin (25th October 2016),schnaader (25th October 2016)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    The author of this page, Antonio Diaz Diaz, is also the author of lzip, which directly competes xz.

    https://news.ycombinator.com/item?id=12768425
    https://www.reddit.com/r/linux/comme...erm_archiving/

  4. The Following 3 Users Say Thank You to Cyan For This Useful Post:

    Bulat Ziganshin (25th October 2016),encode (20th June 2017),nemequ (24th October 2016)

  5. #3
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Cyan View Post
    The author of this page, Antonio Diaz Diaz, is also the author of lzip, which directly competes xz.

    https://news.ycombinator.com/item?id=12768425
    https://www.reddit.com/r/linux/comme...erm_archiving/
    You're right, I should have mentioned that.

    That said, while that does make his motives somewhat suspect, it also means he has some experience in the area. If we were to discount everything he says just because he is the author of a competing piece of software I'm not sure there is much point in this forum… IMHO the best part of this forum is the constructive dialog between the authors of competing software.

    Like I said, he does make some valid points. AFAICT everything in that article describes either a real mistake (like the checksum not protecting the length field) or something which is debatable (like the degree of extensibility). It's a good read for people considering their own formats.

  6. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Sure there are some justified comments in there, but other bits are just nit-picking. Eg who cares if the format has scope for 2^63 extensions. That's just because the author used int64_t as a data type - it doesn't mean there WILL be that many, simply that they chose a large data type. Really it's not worth panicing over and doing so makes the rest of your arguments weaker as people are more likely to demiss the entire article as a holy war.

    Variable length integers are just fine too. They don't cause problems (if protected behind checksums) and lead to smaller files. Why wouldn't you want to use such a thing? It can indeed cause framing errors where failure to decode one field also causes the next to fail, but welcome to compression! That's true of almost every stream and this is why checksums matter. (The failure to checksum some of these fields is a big error though.)

    I wasn't entirely sure I understood the trailing data issue. A format with a hard EOF marker can be useful to tell the difference between a truncated file and a genuine EOF. (Sooner or later one of those truncations will happen at the end of a block.) This does make it hard to just append, without writing a tool to strip off the EOF block, but it adds protection.

    I'm also skeptical on the whole CRC32 is better than CRC64 or SHA256 thing. However offering a myriad of checksums and questionable support by different decoders is a valid concern though.

    In short - he has a few good points, but it gets lost in excessive detail on (IMO) questionable ones so I doubt most will take it seriously.

  7. The Following User Says Thank You to JamesB For This Useful Post:

    schnaader (25th October 2016)

  8. #5
    Member
    Join Date
    Nov 2015
    Location
    boot ROM
    Posts
    83
    Thanks
    25
    Thanked 15 Times in 13 Posts
    IMHO, if someone is up for long term storage, they want to first understand which failure modes they could face and how to mitigate it and how far they're ready to go. It is way more tricky than just buzzing about XZ data format.

    And some things to consider could be at least:

    1) If one about to fight bitrot by data format, most logical option would be FEC (forward error correction) done right. It isn't panacea, but it works and adding FEC to thwart bitrot is better than merely counting damages in hope they aren't very extensive. Sure, it is nice if data format is self-syncing and so on, so it would eventually recover at least something. But it is much better when it does not comes to advanced data recovery and assessing damage in first place. Btw, I've stumbled on FreeArc archive format description and I really wonder why all recovery blocks are put at the end of archive? At first glance it would completely jeopardize FEC if file is truncated, no? Since it isn't uncommon failure mode, esp. in advanced data recovery (where complete fragments reassembly could be troublesome), I wonder why it works like this. Not like if I use FreeArc or something, but I'm curious wrench. IIRC most FEC schemes targeting extensive damage resillence are using interleaving to downgrade intense localized damage (fairly typical for storage and transmission) to low-intensity bitflips all over place, where FEC would just correct it. Some things targeting unreliable medias go really long way, CD-ROM could withstand like 2.5 mm of totally missing track IIRC, doing massive deinterleaving and correcting bunch of errors by strong, multi-layer error correction codes. But well, it is complicated. Doing it in software at similar level of paranoia could be a bit painful and it wouldn't save from some failure modes anyway.

    2) Now let's look on storages a bit more. They are funny. Storage could fail completely in catastrophic way, neither FEC nor good data format would help. Storages could fail the way even best data recovery lab would be helpless. E.g. not even best data recovery lab could do anything about head slap and missing magnetic layer on the platter. It could happen there is nothing to recover. So it is better to have some backup plan if something like this happens.

    3) So people may want e.g. some RAID-like schemes, for example. Yet storages are tricky. Most RAIDs assume drive either works or completely out. Which isn't a case. How about drives pretending read has succeeded but read result is bizarre crap? Sometimes it could even be malicious - there is fancy malware patching HDD firmwares, to ensure victim does not escapes, not even after OS reinstall. Those who are dealing with plenty of data eventually figure out their drives could fail them. If one up for storing large amounts of data on HDDs or SSDs for a while, they're better to be prepared for this kind of crap and have ides how to deal with it.

    4) To make matters worse, there is filesystem. What if failed read would hit filesystem metadata? Now you have your file, it's not even damaged, but hey, it is scattered all over place on this fancy 5TiB drive. And you can't read it back by usual means anymore, because structures describing how to do it were damaged. Sounds fancy, eh? Then you have to learn a lot of crap about abilities of your filesystem and its tools to deal with it. Few mere bad sectors could make filesystem non-mountable or even fatally crashing OS driver once it hits invalid metadata (promising target for fuzzing, btw). How many of you have plan B ready if filesystem fails to mount so you can't get your data by normal means? It is not completely bad. Some things like ZFS and BTRFS integrate with RAID implementation and do checksumming so if there is RAID, filesystem could retry reads from different set of blocks on other drives. Once checksum matches, filesystems located right set of the blocks. But more dumb RAIDs and filesystems could eventually return damaged data, since there is no way to understand which blocks are correct. And even checksum could eventually match on incorrect data unless it long enough and strong enough (which could be slow as well). But ok, these two were targeted to store large amounts of data. Not each and every filesystem is like this.These two have fancy features to prevent data corruption and/or recover data. Yet it could take some efforts.

    5) Even if we assume RAID haven't failed, filesystem worked and so on, how about whole server (desktop, ...) facing catastrophic failure, killing all drives at once? Be it faulty power supply doing power surge, sudden fire, virus, hacker or something. Backups could be a good idea, after all. And they're better to be somewhat "offline", in sense they aren't attached to same computer systems all the time.

    6) Shit could happen even on data-center level and so on, there could be some disaster, after all. So if one keeps plenty of valuable data, they may really consider storing them several times in physically different locations. And if one is up for something like this, they could eventually use unreliable drives and filesystems and ever failure of server isn't a big deal. So as long as there're redundant nodes and reasonable integrity check of blocks, data stay available and not damaged. As long as someone is willing to invest resources into keeping the whole thing running and FEC/redundancy could do its job using remaining storage nodes.

    What this author has wrote is somewhat valid. Yet it focuses on few corner cases, which may or may not be worth of consideration. It heavily depens on what scenarions one considers, which technologies are in use, how many resources one is willing to invest into fault tolerance and so on. Btw, most modern storages aren't meant for long-term storage. Say, modern triple-level cell SSDs wouldn't last for a while - charge dissipates over time on its own. HDDs are better, magnetic field does not dissipates. But their lifetime not infinite. Not even in powered off state (say, their onboard cpu boots from flash). There're some more exotic things like tapes, but they're relatively exotic, costly and do not provide random access. In long term it also comes down to lifetime of drive and whether you'll be able to get working drive, say, 30 years later.

  9. The Following User Says Thank You to xcrh For This Useful Post:

    Bulat Ziganshin (4th November 2016)

  10. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Stone tablets have been demonstrated to have good data retention, as has vellum.

    Edit: but see Linear A for counter examples...

  11. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by xcrh View Post
    Btw, I've stumbled on FreeArc archive format description and I really wonder why all recovery blocks are put at the end of archive? At first glance it would completely jeopardize FEC if file is truncated, no? Since it isn't uncommon failure mode, esp. in advanced data recovery (where complete fragments reassembly could be troublesome), I wonder why it works like this. Not like if I use FreeArc or something, but I'm curious wrench. IIRC most FEC schemes targeting extensive damage resillence are using interleaving to downgrade intense localized damage (fairly typical for storage and transmission) to low-intensity bitflips all over place, where FEC would just correct it. Some things targeting unreliable medias go really long way, CD-ROM could withstand like 2.5 mm of totally missing track IIRC, doing massive deinterleaving and correcting bunch of errors by strong, multi-layer error correction codes. But well, it is complicated. Doing it in software at similar level of paranoia could be a bit painful and it wouldn't save from some failure modes anyway.
    big thanks for this comment, now i got inspiration for last detail required for FA FEC format. so let's see how it could work.


    1. Modern media lose data in whole blocks, which (depending on media) may be anything from 512 bytes to hundreds KBs. So, the right approach to FEC is to protect entire blocks: if any byte in block is lost - entire block is lost. I call a minimal entity protected by such a program "sector" - it's not necessarily 512 bytes, f.e. I've used 4KB sectors in my research program.

    So, we split entire dataset into N data sectors, compute M ECC sectors and data can be recovered from any N survived sectors. And there is no difference whether ECC sectors goes after data sectors or they are interleaved.

    Of course, with large N we need too much memory to work with all sectors simultaneously, but it may be solved by limiting N to ~64K and pseudo-random interleaving of independently protected sector sets (cohorts).


    2. So far, storing ECC data at the end of archive is perfectly ok. It even has some advantages, allowing to quickly add recovery data to the usual archive and strip it back. In particular, existing FreeArc format just creates full archive including archive directory, then optionally adds recovery record covering everything up to this point including the archive directory. If we lose some bytes at archive end, we just lose part of recovery record, while the archive itself will be fine.

    Actually, losing archive tail is the best scenario, since we if we lose up to M sectors - we even don't need to calculate recovered sectors, and when we lose more than M sectors - full recovery is impossible anyway, and we again lose minimal amount of data sectors and again don't need to calculate any recovered sectors.

    Unfortunately, this great idea is not well implemented. In particular, FreeArc can't search through entire archive looking for directory records, but it's only matter of adding some code to FreeArc.

    More important problem, though, is that recovery record as well as directory record, isn't self-recovering. We can lose just a single byte in the record header to lose entire record. Moreover, since all records are checked by their CRC, it's enough to lose any byte to lose an entire record (and if we disable CRC checking, we cannot say which part of record is correct and which part is just a junk)

    So, the real problem isn't the regular data (source and ECC sectors), but metainformation


    3. Existing FreeArc RR can be ruined just by 2 deliberate "shots" - changing first byte of directory record and first byte of recovery record is enough to kill the entire archive with RR of any size. What we need is to distribute directory+recovery metainfo records among entire archive and protect them with the same ECC technique.

    Let's start with directory record. It contains info about one ore more solid blocks including filelist for each solid block. In order to increase recovery chances, we may split directory record into smaller independent records up to a single solid block per record.

    Recovery metainfo mainly consists of checksums for each source/ECC sector. It can be split into records of any covenient size.

    What we really need is to make each record self-describing, so we can extract useful information just from the record itself, and ECC-protect every record. Moreover, since metainfo is much smaller and more important than usual info, we may prefer to boost its recovery chances. F.e. if user requested adding 1% of recovery info to archive, we can protect metainfo with 10% redundancy.


    4. Self-describing directory record only needs to know its own archive position, so it can map its solid blocks to archive sectors.

    Self-describing recovery metainfo should contain FEC geometry (i.e. values of N, M and cohort size), and define range of sectors whose checksums are stored here.

    It may be a good idea to combine directory and checksums for data sectors of its solid blocks into the single record.


    5. So, now we have some data+ECC sectors. ECC sectors can be stored at archive end or in separate file(s) - examples of later are RAR recovery volumes and PAR2 format.

    We also have some set of recovery metainfo records, whose size is fully controlled by placing more or less checksums into each record. Plus number of directory records, whose size is partially controlled by combining more or less solid blocks into single directory record.

    Our goal is to add some ECC info to these records and then distribute all those meta+metaECC info through entire archive to maximize recovery chances.


    6. Let's start with the simple case - ignore directory records for a moment. We can split recovery metainfo into N1 equal-sized records, compute M1 meta-ECC records of the same size, or just inflate each recovery record with meta-ECC info. In order to evenly distribute these records through the archive, we need to reserve archive sectors at regular basis, f.e. skip 1 sector after each K sectors written to the archive. We can tune number of sectors each RR describes in order to fit RR exactly into the sector size.

    The reserved sectors may be used exclusively for metainfo, with ECC sectors placed at the end of archive, or both recovery records and ECC sectors may be distributed through entire archive.

    The interesting point, though, is that we may prefer to store geometry/ID/checksum with every data sector and/or every ECC sector. This drops the need to ECC-protect these checksums - i.e. we usually will lose both sector and its checksum, or keep both. Essentially, it makes each sector self-described and ready-to-use without any other metadata.


    7. For simplicity of implementation, we can protect directory records just like usual sectors. I think that even if directory will be placed at the end of archive, followed by ECC sectors, it will not lower the recovery chances (compared to interlerleaved data/ECC sectors) as far as ECC metainfo is distributed over entire archive.


    8. Finally, let's look into protection of directory records. They contain very important info and tend to be very small. So, we can raise their redundancy ratio ~10x (as was proposed above), or alloc fixed part of total ECC space (1-10%) for directory protection, or use smart formula like dataECC/directoryECC = sqrt (datasize/dirsize). This means that for many archives directory ECC redundancy will be more than 100%, making important to distribute ECC sectors for directory records over entire archive.

    Small size and high redundancy ratio means that sector size for directory records may be lowered to 128-512 bytes. We can compute optimal size for each archive based on other parameters, such as overall directory size, number of directory records and directory redundancy ratio.

    Since directory records may have arbitrary size and we have only limited control of it (unlike recovery records), we can't ensure that each directory record fills whole number of sectors. Instead, we may start new sector with each new record, or pack them, or start new sector only if entire record can't be packed into the rest of last sector, or even duplicate sectors containing data from multiple directory records.

    Each directory record should store info enough to map it to the "directory data sector" range. Archive will contain directory records considered as "directory data sectors" plus directory ECC sectors distributed over entire archive. I think that there is no much need to distribute directory records themself over archive: if they are large enough, they will be somewhat distributed anyway, and if they are small enough, then directory ECC redundancy will be >100% and it will be enough that directory ECC sectors are distributed.



    PS: Please share your vision how ideal archive recovery should look, in particluar how to store ECC data ifn't at the archive end, and/or point me to existing proposals/implementations. and give critics to my ideas
    Last edited by Bulat Ziganshin; 4th November 2016 at 17:14.

  12. #8
    Member
    Join Date
    Jun 2017
    Location
    Moscow
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi all!

    Sorry for bringing up some stale conversation, here are some ideas about long-term archiving and recovery:

    1) Error models for typical use cases should be determined, and a test utility like SMHasher should be written to verify recovery properties against those error models.

    There are some interesting error patterns on NAND FLASH error analysis here: https://pdfs.semanticscholar.org/5a0...e13050a274.pdf.
    NAND FLASH is now used almost everythere (mobile phones/tablets, SSD, USB drives), but it is hidden under lots of abstraction layers, so only errors that can leak
    through all those layers can be seen on file format level.

    Usually entire page (1K-4K) or entire block (128K-512K) is corrupted, and number of flipped bits is high enough to leak through all the hardware ECC.

    Similar data can be found for magnetic tapes, DVD/BluRay, HDD and other media.

    2) Introducing some redundancy into file format (like replicating and distributing copies of directory entries) increases overall robustness, but without knowing the exact error model
    can lead to unnecessary bloat. This can be solved by introducing some pre-configured redundancy settings (i.e. NAND FLASH, Magnetic Tape, HDD)
    so users can choose what is best them.

    3) Following Unix Way, FEC-codec can be implemented as a separate program, which takes any file (compressed or not) and adds FEC/interleaving/redundancy to it.
    There is already such program as vdmfec. This way, there can be several implementation of FEC codecs, fine-tuned to present and future error models.

  13. #9
    Member
    Join Date
    Jun 2017
    Location
    Brussels
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Damir View Post
    There are some interesting error patterns on NAND FLASH error analysis here: https://pdfs.semanticscholar.org/5a0...e13050a274.pdf.
    NAND FLASH is now used almost everythere (mobile phones/tablets, SSD, USB drives), but it is hidden under lots of abstraction layers, so only errors that can leak
    through all those layers can be seen on file format level.

    Usually entire page (1K-4K) or entire block (128K-512K) is corrupted, and number of flipped bits is high enough to leak through all the hardware ECC.
    Hi,

    This reminds me that lziprecover already can fix errors in NAND flash devices, if more of one copy of the file is made.

    http://www.nongnu.org/lzip/manual/lz...#Merging-files

    "Here is a real case of successful merging. Two copies of the file 'icecat-3.5.3-x86.tar.lz' (compressed size 9 MB) became corrupt while stored on the same NAND flash device. One of the copies had 76 single-bit errors scattered in an area of 1020 bytes, and the other had 3028 such errors in an area of 31729 bytes. Lziprecover produced a correct file, identical to the original, in just 5 seconds:"

Similar Threads

  1. Long Range RLE, the world's fastest compressor
    By m^2 in forum Data Compression
    Replies: 1
    Last Post: 28th May 2014, 21:31
  2. pxz: Multi threaded xz compressor
    By polemon in forum Data Compression
    Replies: 0
    Last Post: 21st March 2011, 02:55
  3. XZ Utils
    By Surfer in forum Data Compression
    Replies: 1
    Last Post: 31st January 2011, 01:53
  4. StuffIt X Format
    By maadjordan in forum Data Compression
    Replies: 19
    Last Post: 9th August 2008, 13:03
  5. Long live ENCODE.RU!
    By encode in forum Forum Archive
    Replies: 8
    Last Post: 15th April 2008, 20:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •