Page 1 of 3 123 LastLast
Results 1 to 30 of 81

Thread: tar replacement for Cyan

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts

    tar replacement for Cyan

    http://nishi.dreamhosters.com/u/shar_v2.rar

    Code:
    Shar version 02 [03.11.2011 12:41]. File archiving utility.
    Copyright (c) 2011 ConeXware, Inc.  All Rights Reserved.
    
    Usage:
    
      shar a archive path -- concatenate files/subdirs at path into "archive"
      shar a archive path1 path2 path3...
      shar a archive @lst -- add from paths listed in a file "lst"
      shar a archive @-   -- add from paths supplied from stdin
      shar a archive path1 @- path2
      shar a - path       -- output archive data to stdout
      (archive is appended if it already exists)
    
      shar x archive base -- extract files/dirs from archive to base\
      shar x archive      -- extract to current directory
      shar x - base       -- extract from stdin to base\
    
      shar l archive      -- list the contents of an archive
      shar l -            -- list from stdin
    
    Notes:
    
    1. Before exit, shar prints a message "Result: xxx" to stderr.
    Its the OS message corresponding to last encountered error code.
    
    2. Thanks to windows quirks, "path" _partially_ supports wildcards (?*)
    Eg. "shar a 1 *.exe" would work right, but won't include any folders (except matching *.exe)
    Also, "shar a 1 .." or "shar a 1 ." would include the actual name of the base folder.
    Same way, "shar a 1 folder" would store paths including the specified folder, while
    "shar a 1 folder\*" would not.
    
    3. File/dir attributes, streams, security info, timestamps are not preserved.
    Links/junctions are not preserved (added as what they point to).
    
    4. List files for "a" command are presumed to contain utf8 paths.

  2. #2
    Member kampaster's Avatar
    Join Date
    Apr 2010
    Location
    ->
    Posts
    55
    Thanks
    4
    Thanked 6 Times in 6 Posts
    Shelwien
    Thanks!

  3. #3
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    /me remembers sharutils.
    http://www.gnu.org/s/sharutils/

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I know, but ash, par, paf seemed more relevant, so...

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Code:
    // Converts UTF-8 string to UTF-16
    WCHAR *utf8_to_utf16 (const char *utf8, WCHAR *_utf16)
    {
      WCHAR *utf16 = _utf16;
      do {
        BYTE c = utf8[0];   UINT c32;
             if (c<=0x7F)   c32 = c;
        else if (c<=0xBF)   c32 = '?';
        else if (c<=0xDF)   c32 = ((c&0x1F) << 6) +  (utf8[1]&0x3F),  utf8++;
        else if (c<=0xEF)   c32 = ((c&0x0F) <<12) + ((utf8[1]&0x3F) << 6) +  (utf8[2]&0x3F),  utf8+=2;
        else                c32 = ((c&0x0F) <<18) + ((utf8[1]&0x3F) <<12) + ((utf8[2]&0x3F) << 6) + (utf8[3]&0x3F),  utf8+=3;
    
        // Now c32 represents full 32-bit Unicode char
        if (c32 <= 0xFFFF)  *utf16++ = c32;
        else                c32-=0x10000, *utf16++ = c32/0x400 + 0xd800, *utf16++ = c32%0x400 + 0xdc00;
    
      } while (*utf8++);
      return _utf16;
    }
    
    // Converts UTF-16 string to UTF-8
    char *utf16_to_utf8 (const WCHAR *utf16, char *_utf8)
    {
      char *utf8 = _utf8;
      do {
        UINT c = utf16[0];
        if (0xd800<=c && c<=0xdbff && 0xdc00<=utf16[1] && utf16[1]<=0xdfff)
          c = (c - 0xd800)*0x400 + (UINT)(*++utf16 - 0xdc00) + 0x10000;
    
        // Now c represents full 32-bit Unicode char
             if (c<=0x7F)   *utf8++ = c;
        else if (c<=0x07FF) *utf8++ = 0xC0|(c>> 6)&0x1F,  *utf8++ = 0x80|(c>> 0)&0x3F;
        else if (c<=0xFFFF) *utf8++ = 0xE0|(c>>12)&0x0F,  *utf8++ = 0x80|(c>> 6)&0x3F,  *utf8++ = 0x80|(c>> 0)&0x3F;
        else                *utf8++ = 0xF0|(c>>18)&0x0F,  *utf8++ = 0x80|(c>>12)&0x3F,  *utf8++ = 0x80|(c>> 6)&0x3F,  *utf8++ = 0x80|(c>> 0)&0x3F;
    
      } while (*utf16++);
      return _utf8;
    }

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    just for fun - many years ago i implemented smallest archiver using haskell:

    Code:
    import Data.Char
    import System.Environment
    import System.IO.Unsafe
    
    encodeInt n    = map chr [n 'mod' 256, (n 'div' 256) 'mod' 256, n 'div' 65536]
    encodeStr x    = encodeInt(length x) ++ x
    encodeFile f   = encodeStr(f) ++ encodeStr(unsafePerformIO (readFile f))
    main = do (arc:files) <- getArgs; writeFile arc (concatMap encodeFile files)
    usage: archiver.exe archive files...

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts

    Thumbs up

    Thanks very much Shelwien !
    I will look into. I'm a bit busy during this week end, so expect some feedback by tomorrow.

    Regards

  8. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Eugene, what is a code license?
    Last edited by Bulat Ziganshin; 29th October 2011 at 15:30.

  9. #9
    Member
    Join Date
    Oct 2011
    Location
    Leeds, West Yorkshire. UK.
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi all, my first time post. I'm at risk here of 'teaching Grandma how to suck eggs' but I've never used Tar but I do concatenate file and folders prior to creating the final archive. Usually the tool I use is the archiver of choice, told to store and recurse (if required) then run the archiver of choice with your chosen parameters/settings. It has never failed me yet.

    As a useful side-effect you only need the 1 software the decode and restore the contents. HFCB might benefit from this method.

    Jus an old dos guy but perhaps I'm on the right track. Recycle!, Reduce. Rejoice!

    Best regards.
    Ste.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    @Bulat:
    1. Thanks for utf8 functions, I'd use that as a reference.
    But what I'd like to see is a real text or filename
    which pushes the limits of utf8.
    2. (haskell archiver) Isn't that limited to <16M files?
    3. (license) Added one.

    @Veggieste:
    There're lots of experimental compressors which only support
    file-to-file compression, without archiving.
    But the choice of archiving tools that could be used with these
    compressors is very limited - rar or 7z basically - others
    would have problems with unicode filenames or something else
    (eg. \\.\C:\boot.ini).
    And then, distributing rar or 7z with your own compressor is a little
    troublesome.

  11. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. you can use charmap.exe to generate some char with code >16384 and then rename file in explorer to this name. alternatively, you can generate file with such name using program and then look at it in Explorer. in particular, China language has a LOT of chars

  12. #12
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    A few words of feedback :

    1) The archiver works as intended. It also passes the "cyrillic alphabet" checkpoint.
    It is small (only 7K!) and efficient, so kudo, it's the way to go.
    I will use this replacement over tar without hesitation.

    2) Reading the included documentation :
    shar_license : Copyright (c) 2011 ConeXware, Inc. Is that correct ?
    shar_history : maybe some dates are mixed-up ?
    Are the listed limitations/bugs applicable to latest shar version ?

    3) Potential improvements :
    - Listing files is good for debug. In release mode, it should be kept as a -verbose option.
    - Overwrite situation should be detected, and require some form of confirmation (either implicit through an option, or user-confirmed).
    If you don't have time or are not interested in those suggestions, don't bother : since you already provided the source code, i can at least make an effort to read and modify it.

    4) I still have a bug, sometimes, when using a pipe between shar and LZ4. I will need some time to understand what's wrong, since each program runs fine alone with a pipe redirection. Expect some feedback when that's done.

    Very nice work Shelwien !
    Last edited by Cyan; 31st October 2011 at 02:43.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > 1) The archiver works as intended. It also passes the "cyrillic alphabet" checkpoint.

    Yes, but there're expected problems with "traditional chinese" or something like that,
    due to limited utf32 support.
    One possible workaround is to use conversion routines provided by Bulat in this thread,
    but I'd try to improve my own ones later.

    > It is small (only 7K!) and efficient,

    That's VC6 version. gcc/mingw produces 10-11k exes.

    > I will use this replacement over tar without hesitation.

    Good :)

    > 2) Reading the included documentation :
    > shar_license : Copyright (c) 2011 ConeXware, Inc. Is that correct ?

    Yes. The scanner library there is a part of "paf" and http://patchbeam.com/ .
    But its still open source, so I don't see any problems for you from that.

    > shar_history : maybe some dates are mixed-up ?

    Yes, thanks :)

    > Are the listed limitations/bugs applicable to latest shar version ?

    Yes.
    In [2], the handling of base folder seems like a reasonable "feature",
    but for full wildcard support its necessary to write a corresponding
    filter function (trying to use OS functions for that would be very inefficient -
    something like scanning each directory twice etc).
    As to [3], its quite a bit of work, and not really compatible with
    archive format based on per-file headers - for example, its only possible
    to set directory timestamps after finishing writing all files/dirs in it,
    so with current shar's archive format we'd have to reconstruct the directory
    tree in memory.
    In comparison, formats with a single solid index (like .7z) are more efficient
    for compression and easier to process, but unfortunately totally not compatible
    with pipe i/o - we can't put the index at start of archive, because its only
    fully known at the end of archive creation, and we can't extract the archive
    from a stream when it doesn't start with the index.
    As to utf8 ([4]), its possible to use Bulat's functions or even winapi (slow) for that,
    but I'm not in the mood for fixing that atm - please send me an example where it fails,
    if you really want me to fix that :)

    3) Potential improvements :
    > - Listing files is good for debug. In release mode, it should be kept as a -verbose option.

    This reminded me to add a command for listing archives (see v1a).
    But I don't agree about -verbose, because its easy to hide these messages - did you see test.bat?
    Code:
    shar a - ??? 2>nul | shar x - temp\ 2>nul
    > - Overwrite situation should be detected, and require some form of
    > confirmation (either implicit through an option, or user-confirmed).

    Not sure about that, personally I am very annoyed when a program
    stops working and starts asking questions.
    Of course, its very easy to add, but as a non-default option it
    won't be ever used and as a default option it would be bothering me,
    and its actually inefficient - to implement that, we have to first
    try opening each file/dir in "read" mode, and only create it if it
    doesn't exist - lots of duplicate work.
    How about maybe making a test mode which would set "errorlevel"
    if overwrites would be required on extraction?

    > If you don't have time or are not interested in those propositions, don't bother

    Well, I'm interested in discussing them, at least.

    > since you already provided the source code, i can at least make an
    > effort to read and modify it.

    That sounds like a good idea too :)

    > 4) I still have a bug, sometimes, when using a pipe between shar and
    > LZ4. I will need some time to understand what's wrong, since each
    > program runs fine alone with a pipe redirection. Expect some
    > feedback when that's done.

    Its a little tricky. Plain file redirection in windows is the same
    as opening the file directly - ie seek etc is available and basically
    full access to the file.

    While with a pipe the main difference is the fact that a sync file read
    expected to read N bytes can return <N as result, but it won't be
    the end of file.

    I guess this means that it can be a bug in shar too, as it reads header
    data once without checks, but I'd appreciate if you could test that.

  14. #14
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Thanks, the pipe issue seems solved.
    As you mentioned Shelwien, indeed the "blocking" read function sometimes return without fully filling the provided buffer.
    Basically, it provides "what is available" in the pipe.
    Therefore, in a situation where the consumer is faster than the provider, the read buffer is not filled completely.

    This is fixed in the version provided in this link :
    http://sd-1.archive-host.com/membres...v12d-alpha.exe

    I'll keep it in "alpha stage" for the moment, while doing some more tests.
    But it looks promising. Even very large directories with tons of files and weird names work flawlessly. It seems to work like a charm.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    http://nishi.dreamhosters.com/u/shar_v1b.rar
    Code:
    v1a  31-10-2011 http://nishi.dreamhosters.com/u/shar_v1a.rar
     + "l" command added for listing archive contents
    
    v1b  31-10-2011 http://nishi.dreamhosters.com/u/shar_v1b.rar
     + ReadFile wrapper to wait and read exactly N bytes from streams
     + Original unpreprocessed source included, for a change
    
    todo:
     - proper utf32 support
     - proper wildcard support
     - commandline parsing (with -q (quiet) as example)
     - an option to control file overwrites (overwrite/skip/ask)
    1. Please tell me which form of source you prefer - this is how it looks originally.
    Single shar.cpp is made by passing it through C++ preprocessor.

    2. I ended up fixing the ReadFile thing in shar too, and looks like it really was a bug,
    because before I also had a problem with shar | plzma test on a big folder, but
    blamed plzma for it - and now the same test passed without any problems.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Btw, I also made this thing once upon a time - http://nishi.dreamhosters.com/u/down.exe
    Its a web installer or something like that. I can post the source too, wanna use it?
    Also you can use http://nishi.dreamhosters.com:8888/ for file storage.

  17. #17
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    > Please tell me which form of source you prefer

    Well, i'm not sure.
    I find your way of writing sources more "readable",
    but unfortunately, IDE do not see it that way.
    Parsing, indentation, underlining, warnings, and so on, are confused.

    So, for customisations, "single source file" version is preferable, since it will be easier to modify.
    And for reading calmly, i like your own way, with *.inc files.


    > you can use http://nishi.dreamhosters.com:8888/ for file storage.

    Thanks. I've uploaded there latest alpha :
    http://nishi.dreamhosters.com/v/LZ4_...1.2d-alpha.exe


    > Its a web installer or something like that. I can post the source too,

    Sounds like it can be useful for updates.
    It's a bit soon for me to make good use of it today, but it will certainly become useful for one of the next releases.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I find your way of writing sources more "readable",
    > but unfortunately, IDE do not see it that way.

    If its only about IDE not seeing .inc as source files, its usually solved by adding .inc to IDE config.

    > And for reading calmly, i like your own way, with *.inc files.

    Then maybe you can also try my IDE replacement:
    1. Install Far - http://farmanager.com/download.php?l=en
    2. Install the Colorer plugin (http://colorer.sourceforge.net/farplugin.html)
    Download http://kent.dl.sourceforge.net/proje...ar2_1.0.3.4.7z
    Extract into /Far/Plugins/
    3. Start Far, navigate to source (.cpp/.inc), press F4

  19. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    In comparison, formats with a single solid index (like .7z) are more efficient
    for compression and easier to process, but unfortunately totally not compatible
    with pipe i/o - we can't put the index at start of archive, because its only
    fully known at the end of archive creation, and we can't extract the archive
    from a stream when it doesn't start with the index.
    You could interleave 2 separately compressed streams, one with data and the other with metadata. It seems more elegant then dumping them together and while I'm not really sure that it would be more space efficient, at least it makes operations on metadata (like listing contents) much faster.

    Quote Originally Posted by Shelwien View Post
    As to utf8 ([4]), its possible to use Bulat's functions or even winapi (slow) for that,
    but I'm not in the mood for fixing that atm - please send me an example where it fails,
    if you really want me to fix that
    I don't ask you to fix that ATM, but have a question: In what way does it fail when it finds unsupported character? Does it refuse to compress it, crash, silently produce incorrect archive or else?

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > You could interleave 2 separately compressed streams, one with data
    > and the other with metadata.

    Yes, but such interleaving would only improve compression in pipe mode,
    while making format parser considerably more complicated.
    If pipe support is the only requirement, it would be much easier to
    just build the index first, write it, and then start compressing the files.
    This way it also makes sense for archivers that do file sorting
    before processing.

    But directory scanning with lots of small files can be considerably slow too,
    so I prefer to do it in parallel with compression.

    > It seems more elegant then dumping them together and while I'm not
    > really sure that it would be more space efficient, at least it makes
    > operations on metadata (like listing contents) much faster.

    In that sense, there's nothing better than a single global index -
    otherwise you'd practically have to read the whole archive.

    A global index at the end of archive is also good for archive updating -
    we can append it inplace, overwrite the old index and add a new one.

    And anyway the pipe mode in archiver is only useful when some other
    program does the actual compression, which is pretty rare.

    > I don't ask you to fix that ATM, but have a question: In what way
    > does it fail when it finds unsupported character? Does it refuse to
    > compress it, crash, silently produce incorrect archive or else?

    It would mix up U+1xxxx symbols in the filenames stored in the archive.
    So aside from incorrect chinese filenames I don't expect any problems.
    The version with proper utf32 support won't even become incompatible
    with this one.

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,612
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    If pipe support is the only requirement, it would be much easier to
    just build the index first, write it, and then start compressing the files.
    This way it also makes sense for archivers that do file sorting
    before processing.
    No. Because sb. might dump a trillion of tiny files at you, you need some limit to the initial block size.
    The good thing about doing it my way is that it's very flexible. Actually I never said that it should be 1filemetadata-1filedata-1filemetadata-1filedata-... but rather ~nfilesmetadata-~nfilesdata-... with n being something reasonable. At first I thought about n being 'several', but now I don't really see a problem with 1000. In many cases it would let you discard a metadata dictionary before starting to decompress files. Also, you can have an index at the beginning, but when user asks you to add some files to archive, it's just a matter of appending a next piece. Updating existing files is hard anyway, I think that it would be good if archives included some patching engines to allow much faster but less space efficient updates, but that's complex and breaks streaming.

    Quote Originally Posted by Shelwien View Post
    But directory scanning with lots of small files can be considerably slow too,
    so I prefer to do it in parallel with compression.
    Never thought about it. Guess it's good for fast archivers.

    Quote Originally Posted by Shelwien View Post
    > It seems more elegant then dumping them together and while I'm not
    > really sure that it would be more space efficient, at least it makes
    > operations on metadata (like listing contents) much faster.

    In that sense, there's nothing better than a single global index -
    otherwise you'd practically have to read the whole archive.
    Yeah, but like I said it's inherently incompatible with streaming.

    Quote Originally Posted by Shelwien View Post
    And anyway the pipe mode in archiver is only useful when some other
    program does the actual compression, which is pretty rare.
    I also don't understand why do so many people ask for it, though there's another advantage: it's more secure. A program that pushes data between pipes doesn't need rights like scanning directory contents.

    Quote Originally Posted by Shelwien View Post
    > I don't ask you to fix that ATM, but have a question: In what way
    > does it fail when it finds unsupported character? Does it refuse to
    > compress it, crash, silently produce incorrect archive or else?

    It would mix up U+1xxxx symbols in the filenames stored in the archive.
    So aside from incorrect chinese filenames I don't expect any problems.
    The version with proper utf32 support won't even become incompatible
    with this one.
    Well, silent data corruption is the worst way to fail IMO. But it's not that it matters to me in this case.
    Last edited by m^2; 31st October 2011 at 14:47.

  22. #22
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by Shelwien View Post
    > I don't ask you to fix that ATM, but have a question: In what way
    > does it fail when it finds unsupported character? Does it refuse to
    > compress it, crash, silently produce incorrect archive or else?

    It would mix up U+1xxxx symbols in the filenames stored in the archive.
    So aside from incorrect chinese filenames I don't expect any problems.
    The version with proper utf32 support won't even become incompatible
    with this one.
    I wonder Windows will just feed you Unicode UTF-16 surrogate pairs instead.

  23. #23
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    2 quick comments after some more testings :

    - Could it be possible to have shar somewhat "protected" against garbage input ?
    This is probably a difficult feature. Obviously, it is not a nominal situation.
    But well, should it happen anyway,
    shar goes havok currently, creating many random directories everywhere.

    - The "l" listing command seems broken when using pipe as input.
    Probably your readpipe correction for "x" extract is not applicable to "l".

    Rgds

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    v1c 01-11-2011 http://nishi.dreamhosters.com/u/shar_v1c.rar
    + "l": fix to use reads instead of seek()
    + format validation: if type is not D or F, just stop
    + utf8.inc: replace functions with winapi wrappers

    todo:
    - new utf8<->utf16 functions (speed opt, to replace winapi)
    - proper wildcard support
    - commandline parsing (with -q (quiet) as example)
    - an option to control file overwrites (overwrite/skip/ask)
    - "l": add file checks and error code

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I wonder Windows will just feed you Unicode UTF-16 surrogate pairs instead.

    That's one of the reasons why I'm asking for examples.

    Its very tricky actually. For example, multiple encodings of the same symbols are possible -
    in other words, there's no easy way to test whether two different utf8 strings are
    the same filename or not.

    > shar goes havok currently, creating many random directories everywhere.

    Added a check, hope that would be enough for now.
    I can additionally add a crc check for file headers, but that's extra redundancy...

    > The "l" listing command seems broken when using pipe as input.
    > Probably your readpipe correction for "x" extract is not applicable to "l".

    No, I just forgot that seek won't work for streams :)

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > No. Because sb. might dump a trillion of tiny files at you, you need
    > some limit to the initial block size.

    You're right, solid index kinda implies that it would be small enough to
    fit in memory - with index at the end we have to keep it until the end
    of compression, and with index at start we have to read it first and
    keep in memory until the end of decoding.

    But we have virtual memory and x64 these days, so I still don't see much
    sense in overcomplicating things.

    To be specific, suppose we have an input thread which reads data from
    archive and sends it to a codec. With a solid index it can be very
    simple and just read data until told to stop or read a fixed amount known
    in advance. If index interleaving is added, it would also have to
    parse the archive format, send index data to other threads, and who knows what else.

    For example, even with solid index I had a problem in paf, where I had
    to restore EOF coding in ppmd (basically duplicating the file size info) -
    index processing could only work sequentially (we have to create a folder
    to put a file there), while decompressor processed the data ahead of index
    (why should it wait if it can work?) and as a result didn't know where to
    stop without explicit EOF in the stream. An alternative would be to
    accumulate the data size from the already processed part of index
    (index is also compressed and is decoded in parallel with main data)
    and force the decoding thread to wait when it reaches that "partial data size"
    limit. But its easy to make up an example where this would be inoptimal
    (index processor got stuck waiting for OS response at some mkdir, while decoder
    waits at the limit, though there's actually lots of data to decode).

    Anyway, I prefer to spend time on new functionality, rather than getting
    stuck with a bloated framework just because of some "smart" data structures.

    > The good thing about doing it my way is that it's very flexible.
    > Actually I never said that it should be
    > 1filemetadata-1filedata-1filemetadata-1filedata-... but rather
    > ~nfilesmetadata-~nfilesdata-... with n being something reasonable.

    I meant that from the start, because "1filemetadata-1filedata-" is
    what shar currently does.

    But as I said, its trickier than it seems.
    For example, paf stores the archive index in tree form, ie
    no full path for each file - I tested it, and it was more
    compact that 7z index even with 3 8-byte timestamps per object.
    Now, how would you split a tree into chunks?

    Also n in "n files" can't be a fixed value, because both
    "nfilesmetadata" and "nfilesdata" are too variable in size.
    For example, windows supports file paths up to 32k wchars,
    ie up to 96k utf8 bytes.

    > when user asks you to add some files to archive, it's just a matter
    > of appending a next piece.

    No, its only the case with solid index at the end.
    Otherwise you need to read the whole archive file to eg. skip
    already added files (by a hash).

    > Updating existing files is hard anyway, I think that it would be
    > good if archives included some patching engines to allow much faster
    > but less space efficient updates, but that's complex and breaks
    > streaming.

    I actually plan to implement something like that - eg. caching a blockhash
    of archive somewhere and using it for update with deduplication.
    I mean, if the blockhash file doesn't exist, we have to decode the archive
    first and rebuild it; otherwise just append new data.

  27. #27
    Member
    Join Date
    May 2008
    Location
    HK
    Posts
    160
    Thanks
    4
    Thanked 25 Times in 15 Posts
    Quote Originally Posted by Shelwien View Post
    > I wonder Windows will just feed you Unicode UTF-16 surrogate pairs instead.

    That's one of the reasons why I'm asking for examples.

    Its very tricky actually. For example, multiple encodings of the same symbols are possible -
    in other words, there's no easy way to test whether two different utf8 strings are
    the same filename or not.
    Here you go.
    Attached Files Attached Files

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    6-byte symbol code?.. That's cool, thanks (updates the possible max path length to 192k though)
    But I didn't quite get which version of shar you used. If its v1c, it should be correct - it uses winapi for conversion.
    However my XP doesn't display the name of second file, and google.translate doesn't translate it.

  29. #29
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    856
    Thanks
    447
    Thanked 254 Times in 103 Posts
    Thanks for improvements Shelwien.

    An update version of LZ4 has been uploaded, using your new shar :
    http://nishi.dreamhosters.com/v/LZ4_....2d-alpha2.exe
    It allows display of compressed archives content (right-click on an archive, and select "view content")

    I'm also trying to provide compression level control, but i just discovered that in Windows Seven (and presumably in Windows Vista too), an application is no longer authorized to change anything into its own directory (when stored in "Program Files" directory).
    Security control has changed compared to XP.

    I can probably hack something, but before doing something ugly, i was wondering what is considered the "proper way" to save some preference parameters file in Windows.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > save some preference parameters

    Well, registry.

    But if you're concerned about program updates and such, I guess you can look at chrome -
    its installed at C:\Documents and Settings\%USERNAME%\Local Settings\Application Data\Google\Chrome
    and also keeps its updates there.

Page 1 of 3 123 LastLast

Similar Threads

  1. GNU Tar 1.24
    By Surfer in forum Data Compression
    Replies: 0
    Last Post: 27th October 2010, 01:39
  2. CCMx Benchmark, for MMB Executables stored in TAR/7z/RAR/QFC
    By Raymond_NGhM in forum Forum Archive
    Replies: 9
    Last Post: 21st April 2008, 23:04
  3. CCMx Benchmark, compress on stored .TAR/.7z/.RAR
    By Raymond_NGhM in forum Forum Archive
    Replies: 6
    Last Post: 20th April 2008, 19:26

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •