Not ever computer has x64, for example my cell phone doesn't and some netbooks don't have it either.
Separating reading file format to another thread would be simpler and cleaner, though I don't like excessive use of threads. I think a task pool might work w/out this drawback.
Understandable.
How is it a problem? Order of files in the archive defines the split; if we have ...-MetadataA-FilesA-MetadataB-...
then MetadataA should contain all nodes that haven't been seen before that are required for FilesA.
There is a complication with pruning the tree when nodes become obsolete, but it doesn't seem to a big thing.
BTW, I just noticed that for streaming to work you need to have a limit on path length. But it's not a big deal; while most filesystems usually can work with arbitrary nesting, software that uses them usually doesn't, Linux has a limit of IIRC 4096 bytes/path and most Windows programs - 259.
Oh, I started to write a reply w/out reading your whole post and came to similar things.
Anyway this variability is why I wrote ~n and not n, I meant that number of files in a metadata block should be variable. BTW, I'm not sure if it's useful, but this doesn't have to be a natural number, you can add half of a file record to a metadata block. It might simplify interaction with an encoder, you can use hard break points when you find that oops, you gave it one record too much to fit within the limit.
Read whole metadata index. If this is a separate stream, you can skip file data, yeah it's a couple of seeks, but not that many and with n~=1000, usually just 1.
Cool.
There is %APPDATA% and %LOCALAPPDATA%, see these two wiki links for more info:
http://en.wikipedia.org/wiki/Environ...path_variables, http://en.wikipedia.org/wiki/Special...em_directories
I am... Black_Fox... my discontinued benchmark
Thanks for guidance.
I've decided to use %APPDATA% for the time being.
The new version offers a simple GUI to change the compression level.
Compression/Decompression is still achieved with context menu.
http://nishi.dreamhosters.com/v/LZ4_....2d-alpha3.exe
While at it, i've tried to see what would happen by selecting multiple files and apply "compress with LZ4" on them.
Unfortunately, instead of starting the processing once with the list of selected files, explorer launches one processing for each selected file.
Not exactly what is wanted.
I've looked around to find a solution. The question seems not new, but it appears there is no simple solution spelled so far. Ideally, I would like to find one simple enough to be implemented with a batch file. Well, if that is possible....
afaik, it's impossible. i've used dll to implement this feature in freearc/issjoiner. you can reuse the dll, just use another CLSID_ShellExtension
I looked around and found 2 solutions for this:
1. http://www.codeproject.com/KB/shell/shellextguide2.aspx
http://nishi.dreamhosters.com/u/ShellExtGuide2_demo.zip
2. Don't immediately run the operation, but instead collect the filenames first.
Something like this:
Also this seems useful: http://www.nirsoft.net/utils/shexview.htmlCode:- run the main app, if there's no other instance already running - init the list somewhere (file,clipboard) with the current target - wait 1-2s - fetch the list of targets and process - add current target to the list - quit
I guess, shar also needs a way to add multiple targets, though...
http://nishi.dreamhosters.com/u/shar_v2.rar
Code:+ updating an already existing archive (just append) + adding multiple targets from command line + adding multiple targets from @list + adding multiple targets from @- (stdin) todo: - new utf8<->utf16 functions (speed opt, to replace winapi) - proper wildcard support - commandline parsing (with -q (quiet) as example) - an option to control file overwrites (overwrite/skip/ask) - "l": add file checks and error code - BUG: 16-bit length for utf8 paths in archive is not enough - always read and write only fixed aligned buffers - directly open a pipe instead of archive file (skip cmd) - patch an external exe into a dll, hook its file functions, run in a thread - workaround for formats with explicit filesize - simulate infinite input, pad input with some zero bytes for buffer flush on decomp, stop the decoding when it writes enough.
UTF-8 <--> UTF-16(WideChar) from GreenPad (MIT-like license):
It has proper handling of Unicode supplementary planes (U+10000 - U+10FFFF)
Code:typedef unsigned long qbyte; typedef unsigned char uchar; static const byte mask[] = { 0, 0xff, 0x1f, 0x0f, 0x07, 0x03, 0x01 }; static inline int GetMaskIndex(uchar n) { if( uchar(n+2) < 0xc2 ) return 1; // 00~10111111, fe, ff if( n < 0xe0 ) return 2; // 110xxxxx if( n < 0xf0 ) return 3; // 1110xxxx if( n < 0xf8 ) return 4; // 11110xxx if( n < 0xfc ) return 5; // 111110xx return 6; // 1111110x } static int WINAPI Utf8ToWideChar( UINT, DWORD, const char* sb, int ss, wchar_t* wb, int ws ) { const uchar *p = reinterpret_cast<const uchar*>(sb); const uchar *e = reinterpret_cast<const uchar*>(sb+ss); wchar_t *w = wb; // no buffer size check(spec) for( int t; p<e; ++w ) { t = GetMaskIndex(*p); qbyte qch = (*p++ & mask[t]); while( p<e && --t ) qch<<=6, qch|=(*p++)&0x3f; if(qch<0x10000) *w = (wchar_t)qch; else *w++ = (wchar_t)(0xD800 + (((qch-0x10000)>>10)&0x3ff)), *w = (wchar_t)(0xDC00 + (((qch-0x10000) )&0x3ff)); } return int(w-wb); } void WriteLine( const wchar_t* str, ulong len ) { // 0000-0000-0xxx-xxxx | 0xxxxxxx // 0000-0xxx-xxyy-yyyy | 110xxxxx 10yyyyyy // xxxx-yyyy-yyzz-zzzz | 1110xxxx 10yyyyyy 10zzzzzz // x-xxyy-yyyy-zzzz-zzww-wwww | 11110xxx 10yyyyyy 10zzzzzz 10wwwwww // ... while( len-- ) { qbyte ch = *str; if( (0xD800<=ch&&ch<=0xDBFF) && len ) ch = 0x10000 + (((ch-0xD800)&0x3ff)<<10) + ((*++str-0xDC00)&0x3ff), len--; if( ch <= 0x7f ) fp_.WriteC( static_cast<uchar>(ch) ); else if( ch <= 0x7ff ) fp_.WriteC( 0xc0 | static_cast<uchar>(ch>>6) ), fp_.WriteC( 0x80 | static_cast<uchar>(ch&0x3f) ); else if( ch<= 0xffff ) fp_.WriteC( 0xe0 | static_cast<uchar>(ch>>12) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>6)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>(ch&0x3f) ); else if( ch<= 0x1fffff ) fp_.WriteC( 0xf0 | static_cast<uchar>(ch>>18) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>12)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>6)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>(ch&0x3f) ); else if( ch<= 0x3ffffff ) fp_.WriteC( 0xf8 | static_cast<uchar>(ch>>24) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>18)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>12)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>6)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>(ch&0x3f) ); else fp_.WriteC( 0xfc | static_cast<uchar>(ch>>30) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>24)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>18)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>12)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>((ch>>6)&0x3f) ), fp_.WriteC( 0x80 | static_cast<uchar>(ch&0x3f) ); ++str; } }
Last edited by roytam1; 3rd November 2011 at 17:14.
btw, afair, utf8 is able to encode integers up to 2^32 or so, using 6 bytes. but Unicode itself use a bit more than 2^20 symbols, those encoded in 4 bytes max
- proper wildcard support
for improved speed, you can compile "*.ext" to "@txe.", matching from the endCode:match(p,s) switch *p case '?': *s && match(p+1,s+1) case '*': match(p+1,s) || (*s && match(p,s+1)) case 0: *s==0 default: *p==*s && match(p+1,s+1)
@roytam1:
Thanks, but that also is only good as a reference... Bulat's version is still nicer than that.
But I want a streamable implementation - like "read a symbol; think a little; maybe write a few symbols" -
without lookaheads, tables, or other weirdness.
@Bulat:
Thanks, I kinda forgot that recursion is also a method.
But in this case I'd like to generate some state machine instead.
I mean, recursive regexps on unicode filenames with lengths up to 32k symbols?
Also I want to make an advanced filter - with multiple inclusion and exclusions masks.
I do - shar uses windows apis for wildcard support and utf8 atm.
Here is the latest package, using Shelwien's newest Shar version :
http://nishi.dreamhosters.com/v/LZ4_....2d-alpha4.exe
The only visible difference is that Folder can now also be selected from within the Folder tab.
Other than that, there are minor adaptations in the background and installation scripts.
I initially planned to release a new GUI program as part of alpha4, as a first step towards multi-selection,
but i completely failed this development, and since my programming time is very limited these days, i had to scale back ambitions.
Anyway, feel free to comment
Actually I wanted to write an additional utility for collection and merging of single arguments from multiple exe runs into a list.
Something like
1. Store the argument (filename)
2. If there're no already running instances, wait for specified time (0.1s or maybe 0.5s)
3. If another instance is already running, send the argument to it and quit
4. return the list in some environment variable
Would you use that, or there's no need to bother?
Yes, indeed i would use it.
I'm slightly afraid of the reliability of a method based on dead-activity timers though.
For this method to become successfull, the dead-activity timer should remain within the 0.1s range (unobtrusive). At this level, there is a risk that any unwanted delay (CPU fully used, HDD turned on, etc.) may break the chain.
But maybe this risk is over-rated, and does not really exist.
With proper testings, maybe it can be ensured that there is no "weird" situations in which the list is splitted into 2 or more parts.
As a side comment :
I tried to compile the Shar source code, using Visual Studio Express 2008.
It works.
But the generated binary is 14K, while the one you provide into the package is 8K.
Is there any specific configuration to apply to reach such result, or is it only reachable with VC6 ?
My idea is that another argument of this utility would be the path of the script to run after collecting filenames.
So depending on implementation, it would either skip names added after running the script, or run it again later
(which would append the archive in shar case).
So I think it should be safe enough either way.
However, I don't quite understand where to acquire the archive name.
Please find in the below link a proposed modification for Shelwien's shar.
http://nishi.dreamhosters.com/v/shar_v2_01y.zip
The modification is very minor : it only affects the way filenames are output during concatenation process.
It is accessible using "as" command. The normal output is still present, using the standard "a" command.
In a stand-alone environment, the idea of using a single line for filename output might look strange. But in combination with LZ4, it makes both program output coexist on a single line, resulting in a more readable information for the user. The result can be observed in alpha5 version :
http://nishi.dreamhosters.com/v/LZ4_....2d-alpha5.exe
To do :
- I would like to also get rid of the latest message "no more file left" which seems to be output by the system. I've not found a way to control this.
- Apply the same logic for decompression
Notes :
- It is now visible that the "concatenation process" tends to buffer input data before flushing it to the compressor through the pipe : the compressor tends to remain at zero for a long time before abruptly outputing several tens of megabytes.
I don't know if this is intentionnal or a side effect.
- The binary version proposed is compiled using /MT, which means there is no dependancy on external runtime library. As a consequence, it is bigger (58K, as opposed to only 8K for Shelwien's version). Nevertheless, even with linking to runtime library, it is still 14K, which is bigger than Shelwiens's version. This last difference cannot be the consequence of the alternative output. So what could be the reason ?
> http://nishi.dreamhosters.com/v/shar_v2_01y.zip
I looked at it, and didn't like the printf("\b") loop -
imho it can be pretty slow, can't you make a string of 56 \b
and print it at once, or, better, use \r?
Also what's the point in 54.54, won't %-54s do the same?
Also you might need a fflush there, though maybe it would
be ok because of stderr redirection, but normally gcc caches
the printf output and only actually prints anything after \n.
> - I would like to also get rid of the latest message "no more file
> left" which seems to be output by the system. I've not found a way
> to control this.
The line printf( "Result: %s\n", GetErrorText() ); in shar.cpp does that.
In shar I've tried to work around localization by making use of
(already localized) OS.
Although atm as an example, there's only SetLastError(ERROR_BAD_FORMAT)
used when shar x/l encounters wrong archive syntax.
But you can add SetLastError(0) to get a success message, or simply disable
that Result: line for now.
> - It is now visible that the "concatenation process" tends to buffer
> input data before flushing it to the compressor through the pipe :
> the compressor tends to remain at zero for a long time before
> abruptly outputing several tens of megabytes.
Shar itself certainly doesn't buffer anything, and I didn't observe
such large buffers in OS pipes either.
Can't that be a delay on LZ4 side? Maybe it waits until some large input buffer is filled?
> - The binary version proposed is compiled using /MT, which means
> there is no dependancy on external runtime library.
In my version there's dependency only on msvcrt.dll which is
available everywhere.
> Nevertheless, even with linking to runtime library, it is still 14K,
> which is bigger than Shelwiens's version. This last difference
> cannot be the consequence of the alternative output.
> So what could be the reason ?
I'm using VC6 library where I can.
It lacks some things (like fopen_s and MT locks in RTL functions),
but otherwise its much less bloated... and compatible with msvcrt.dll.
You can find the library eg. there - http://nishi.dreamhosters.com/IC_11-1-65.rar
It can be used with newer VC too (as I do).
Also the .exe is 11-12k using http://tdm-gcc.tdragon.net/
OK, good idea, this is correcteddidn't like the printf("\b") loop - imho it can be pretty slow, can't you make a string of 56 \b and print it at once
It doesn't work.better, use \r
The reason is : the first part of the status line is for LZ4, while the second part is for Shar.
With a "\r", the cursor gets back to the beginning of the line.
Then, when Shar writes something, it just erases the LZ4 message. I've not found a way to "skip" the first part of the line without erasing it in the process.
Thanks, this is corrected.The line printf( "Result: %s\n", GetErrorText() ); in shar.cpp does that.
The message is displayed in "normal" mode, and not displayed in "single line" mode.
I also suspected the system to have a role in this behavior, and indeed it has.Shar itself certainly doesn't buffer anything, and I didn't observe such large buffers in OS pipes either. Can't that be a delay on LZ4 side? Maybe it waits until some large input buffer is filled?
Compression startup delay can be attributed to LZ4 up to 8MB. Beyond that point, something else is at stake.
And indeed, what can be witnessed is that the "compression counter" would remain at zero for a while, and then abruptly deliver 30-50MB in a snap. Like a floodgate.
I've tried to play with programs priority, to ensure that the consumer program get more priority than the data generator. It helps a bit, the behavior is partly improved, but still remains visible.
What should be done to use this library within a VS project ? I don't know how to do that.I'm using VC6 library where I can.
Save this last library difference, all your comments are taken in consideration in the newest shar modification proposed below, which also expands the "single line mode" to stream decoding :
http://nishi.dreamhosters.com/v/shar_v2_02y.zip
The combination of this new version of shar with LZ4 produces the alpha6 installer below :
http://nishi.dreamhosters.com/v/LZ4_....2d-alpha6.exe
I'm starting to like the output of this one![]()
> I've not found a way to "skip" the first part of the line without
> erasing it in the process.
Well, ok then, though its certainly possible via console output winapi.
> I also suspected the system to have a role in this behavior, and indeed it has.
That's weird... there're topics like http://stackoverflow.com/questions/1...fering-in-pipe
But it talks about a completely different buffer size - like 4k,
which is what I'd expect it to be too.
And I certainly don't observe this here on XP.
Can you confirm it with a simple read-from-stdin-and-count script?
Something like this - http://nishi.dreamhosters.com/u/incount_0.rar
For me, it is
ie no delay at all. Practically zero.Code:> shar.exe a - C:\Intel* 2>err | incount.exe count=10 count=21 count=50 count=545 count=1041 count=1536 count=2032 count=2528 [...]
> What should be done to use this library within a VS project ? I don't know how to do that.
Ok, you can use this - http://nishi.dreamhosters.com/v/MSC_14-00-50110.rar
Normally its done by configuring includes/libs in VS project properties, but I don't like VS, so...
Also its a bother to collect all the _CRTxxx macros to define for newer VS versions.
I made some tests this morning with Windows XP, and i observed no delay either.I certainly don't observe this here on XP.
So i guess the "system buffering" effect is only visible with Windows Seven (maybe Vista).
1. I also found this -
http://stackoverflow.com/questions/3...pe-win32-api-c
2. And this -
http://msdn.microsoft.com/en-us/libr...(v=vs.85).aspx
I guess you can try adding FlushFileBuffers to shar and/or lz4 and test it.
3. Tested incount on 2008R2. After replacing rdtsc with GetTickCount, it printed 1.5M as the first result.
So there's really some OS buffering.
4. [3] is actually wrong, because when I used the intrinsic __rdtsc instead of my version
(which lacked executable attibute on data represending the rdtsc code)
incount actually started printing small counts again.
So looks like it is possible to get bytes from a pipe without delay on win7.
Incount reuploaded - now that should work on win7 - http://nishi.dreamhosters.com/u/incount_0.rar
I tested incount with Windows Seven, and it shows no delay.
But i'm not sure if it is representative. getc() may behave differently from ReadFile().
Anyway, the "system buffering" effect is not such a big deal, after all any potential delay is dwarfed by HDD seek time.
Here is a proposed new modification of shar.
This time, it tries to modify listing output.
http://nishi.dreamhosters.com/v/shar_v2_03y.zip
As usual, the normal output remains untouched, the alternate output is accessible through an extra command, in this case "la".
I've initially taken the assumption that a directory would be followed by files in the same directory, but that's not completely true.
For example, we can get this sequence :
Dir
Dir/Subdir
Dir/Subdir/FileA
Dir/Subdir/FileB
Dir/FileC
This obviously affects the presentation of the results, since i can no longer assume that all files after a directory are into the said directory. As a consequence, i've taken the easy fix to output the full path. It's correct, but not pleasant to read.
The listing also shows speed limitations in pipe mode. This is understandable : since it is no longer possible to "skip" data, because pipe is non-seekable, all data between 2 entries must be read. So if a file is 250MB long, 250MB must be read before reaching the next entry. It works, it's just a bit slow.
But here i see no solution with pipe mode.
This new version is combined with LZ4 in alpha7 :
http://nishi.dreamhosters.com/v/LZ4_....2d-alpha7.exe
Alpha7 also introduces a nifty feature, unfortunately only available on Windows Seven : the "compress with LZ4" entry does no longer appear in the context menu when the selected file is already compressed.
Btw, now that you finally started modifying shar, what's the point of keeping compatibility with my version?
I mean, why not integrate LZ4 directly into it, and only keeping your version of output etc?
Hi Shelwien
I believe you are right.
However, integration is a lot of work, and therefore requires a lot of time.
The pipe strategy was designed as a quick fix rather than a final solution.
As always, a "quick fix" can last quite a long time
So i agree that in the long term, integration is most probably the better solution.
Since we are talking pipes here, Can someone point me to the source of gclip.exe which is a Win32 PE executable from http://unxutils.sourceforge.net/
I fine the executable gclip that takes stdin and stuffs the text into the clipboard, I use it like this: base64.exe -d TextwithMIME.txt | zpipe.exe | gclip.exe
Now my base64 encoded ZPAQ data is uncompressed text in the clipboard. I downloaded http://sourceforge.net/projects/unxu...c.zip/download
but the source was not there. Anyone have gclip.c or some other ANSI C program that does this *ahem* perhaps Shelwien ?![]()
I don't quite get what you say, but http://nishi.dreamhosters.com/u/clip_v0.rar
Thanks Shelwien !
![]()