Activity Stream

Filter
Sort By Time Show
Recent Recent Popular Popular Anytime Anytime Last 24 Hours Last 24 Hours Last 7 Days Last 7 Days Last 30 Days Last 30 Days All All Photos Photos Forum Forums
  • Shelwien's Avatar
    Today, 02:22
    > Did you mean count how many files can be compressed to the same compressed sequence? Its obvious that its impossible to decompress different data from the same "compressed sequence". > As far as repetition, I meant patterns or frequencies. Its possible to replace the entropy coding with hashing, but in the end, "patterns or frequencies" are the only sources of compressible redundancy. https://en.wikipedia.org/wiki/Kolmogorov_complexity
    3 replies | 99 view(s)
  • Bundle's Avatar
    Today, 01:42
    Thanks! I'll check out the binary diff tool. It will save me loads of time trying to find the error. As far as repetition, I meant patterns or frequencies. My algorithm uses arithmetic but no probabilities or anything. Did you mean count how many files can be compressed to the same compressed sequence?
    3 replies | 99 view(s)
  • Shelwien's Avatar
    Today, 01:25
    > so it's close, but not 100%. Its not a problem, you can always just use a binary diff tool: https://github.com/sisong/HDiffPatch/releases (diff correct file with partially decoded file, then count diff size as part of compressed size). or alternatively ECC: https://en.wikipedia.org/wiki/Parchive > are there any current compression algorithms that don't rely on repetition or patterns? Depends on what you mean by "repetitions". Entropy coding algorithms don't rely on repeated long strings, but do rely on statistics, ie patterns in symbol repetitions. > break some rules of compression. Its not about "rules" or anything. Its more about ability to count the numbers of possible input files and compressed files.
    3 replies | 99 view(s)
  • Bundle's Avatar
    Today, 01:09
    So, the compression algorithm I've been working on doesn't require ANY repetition to be effective. Currently, it's crunching runs of 1,024 bytes down to a 20-byte representation, and that's really squeezing everything out of it. I could stand to raise the representation up to even 512 bytes and still be getting great performance out of it. On a 1GB "random data" file, it compressed at 549 Mbps in 18.7 seconds at 98.24% compression. The compressed file is ~18.5kb now. The decompressor is already coded, but I am working out bugs. I lose a few bytes on decompression, so it's close, but not 100%. I will say in advance, I understand any doubt. I think it's healthy to be skeptical, especially considering this would break some rules of compression. All this to ask, are there any current compression algorithms that don't rely on repetition or patterns?
    3 replies | 99 view(s)
  • Sportman's Avatar
    Yesterday, 23:19
    To have a "true" random file for compression tests.
    53 replies | 2071 view(s)
  • Sportman's Avatar
    Yesterday, 23:17
    241,206 bytes - binary version bcrnd.txt 241,444 bytes paq8pxd 241,423 bytes rz 241,344 bytes 7zip 241,335 bytes paq8px 241,279 bytes rar 241,277 bytes nz 241,274 bytes gzip 241,253 bytes cmix
    53 replies | 2071 view(s)
  • Sportman's Avatar
    Yesterday, 22:20
    Block time is every 10 minutes (600 seconds) average (between some seconds and some hours), so I expected more last digits close to 0 but as far I know there is no (Atom/GPS) time sync in the network or miners hardware.
    53 replies | 2071 view(s)
  • xinix's Avatar
    Yesterday, 22:12
    I do not know why it is needed. But nelson_dec2bin_v0, which made Shelwien https://encode.ru/threads/3122-dec2bin-converter-for-Nelson-s-million-digits-file?highlight=nelson_dec2bin_v0 Compresses this file to 241,206 bytes 241,486 bytes paq8px 241,426 bytes cmix 241,206 bytes nelson_dec2bin_v0
    53 replies | 2071 view(s)
  • Sportman's Avatar
    Yesterday, 22:02
    I had an idea how to create a random file, let the Bitcoin core "miners rat race" block time decide the values. Used all blocks from start till today, no idea if it's real random. 580,880 bytes - block time last digit from 580,880 Bitcoin core blocks 273,794 bytes gzip 273,053 bytes rar 260,849 bytes 7zip 243,518 bytes rz 241,975 bytes nz 241,586 bytes paq9pxd 241,486 bytes paq8px 241,426 bytes cmix
    53 replies | 2071 view(s)
  • fhanau's Avatar
    Yesterday, 16:26
    Ok, I have opened an issue for this on the GitHub page so I don't forget about it. Unfortunately, I am not currently doing any active development on ECT so I don't know when I'll get to fix it.
    398 replies | 100578 view(s)
  • Sportman's Avatar
    Yesterday, 10:36
    Truly random is difficult: https://www.yubico.com/support/security-advisories/ysa-2019-02/ https://www.yubico.com/replaceorder/
    7 replies | 354 view(s)
  • moisesmcardona's Avatar
    14th June 2019, 21:37
    $0.13/KWh here :)
    8 replies | 350 view(s)
  • JamesB's Avatar
    14th June 2019, 14:35
    I recall on one machine if had a loop doing "rand()&1" to get random 0 and 1 bits then they alternated 0, 1, 0, 1, ... haha! Fortunately MOST machines has long since moved to better random number generators. It's quite shocking if visual studio is still so bad.
    7 replies | 354 view(s)
  • Shelwien's Avatar
    14th June 2019, 12:15
    > rnd = 29943829*rnd + 1013904223 In this case, this has a period of 16 :) But with an extra shift, its good: unsigned rnd = 1; unsigned rng() { rnd = 29943829*rnd + 1013904223; return rnd>>24; } 1,000,000 1.hex 500,000 1.bin 500,148 1.7z 136,563 1.bin.paq8px179fix3 1,000,000 2.hex 500,000 2.bin 500,148 2.7z 500,290 2.bin.paq8px179fix3
    7 replies | 354 view(s)
  • schnaader's Avatar
    14th June 2019, 10:39
    “rand() % N“ is known to be bad, see this: https://stackoverflow.com/a/49880109/34065 Basically, if RAND_MAX % N != 0, you get skewed results. Though in this case, it could indeed be the bad random generator, since N==16 should give good results (and the paq8px result points to a bigger skew).
    7 replies | 354 view(s)
  • Bulat Ziganshin's Avatar
    14th June 2019, 03:32
    I can't believe that it's so predictable. What about LCG rnd = 29943829*rnd + 1013904223; // https://en.wikipedia.org/wiki/Linear_congruential_generator and modern ones: 1. http://xoshiro.di.unimi.it/ 2. http://www.pcg-random.org/
    7 replies | 354 view(s)
  • load's Avatar
    14th June 2019, 01:19
    more or less a mirror of ftp.elf.stuba.sk/pub/pc/pack/ ?
    1 replies | 197 view(s)
  • CompressMaster's Avatar
    13th June 2019, 20:04
    Here I found a lot of interesting compressors. Hope it is useful. Link Compress Master
    1 replies | 197 view(s)
  • Shelwien's Avatar
    13th June 2019, 19:27
    Yes :) I tested it with cdm and paq and it doesn't seem compressible. Also, I guess that online service uses some crypto-random... I just tried writing such a generator using VS default PRNG, char b2h = "0123456789ABCDEF"; for( i=0; i<N; i++ ) putc( b2h,g ); and paq8px compressed it to 136k :)
    7 replies | 354 view(s)
  • SpyFX's Avatar
    13th June 2019, 19:22
    hex2bin 500000 50% .... it's my first bid, lol
    7 replies | 354 view(s)
  • CompressMaster's Avatar
    13th June 2019, 19:10
    Here´s hexadecimal compression challenge. It has been generated using online random generator. Challenge is simple - losslessly compress it to the smallest possible size. 1000000 0% hexadecimal data.txt CompressMaster
    7 replies | 354 view(s)
  • Shelwien's Avatar
    13th June 2019, 18:55
    > Q1. 1<<10..1<<16 how it is derived? I just know it from experience: 768,771 SCALElog 427,800 8 379,664 9 362,129 10 355,447 11 352,591 12 351,322 13 350,746 14 350,483 15 I mean, its obvious that having higher precision is better for prediction quality. But I only know actual impact from empirical testing - it depends on sample data, so its hard to derive it mathematically. > Is there any dependency on the number of possible symbols or range? With larger alphabet it can be possible to use worse precision. For example, ppmd works pretty well with 6-bit counters and a byte alphabet. Anyway, if you only care about compression, its always better to use higher precision. I can observe some gains even with 64-bit counters vs 32-bit. But most codecs are designed with speed optimization priority, and that puts a hard limit on possible precision. In fact, we can interpret a counter as a lossily compressed representation of context history. So higher counter precision means more information about context preserved. But, for example, more than 16 bits of probability precision requires 2x memory usage (because cpu only can directly work with 8,16,32,64-bit types), also 17x17 bit product won't fit in a 32-bit variable. Both of these facts are bad for coding speed. There's also a relation between probability precision and rangecoder, yes. 1) range>maxprob has to be always true, otherwise range value could become 0, and there'd be a decoding error at that point. 2) rc should be able to shift the range up by >= number of probability bits, otherwise (1) is impossible. In unoptimized code we can just use a range normalization loop, but its certainly faster to put 1-2 ifs there instead. LZMA only has 1 if for range renorm, since it knows that range only can be reduced by (11-log2i(31))=7 bits per step, so single 8-bit shift can compensate that. >>SCALE/minprob<(1<<8 >Q2. I could not able to understand the idea behind this equation? Practically what does it mean? After (de)coding of a symbol, new rc range is range*SCALE/p And there's only one renorm iteration, which can add 8 bits to range. From that we can see the minimum possible probability which won't cause rc underflow. > got the results as shown in attached image NoOfBitsVsStepsTo31Prob.PNG. > Is this observation have anything to do with the choice of probability model > that we are using? I don't think so. (p>>5) becomes 0 for p<32, so p-=(p>>5) can't produce p value lower than 31. Which in turn affects rc precision, but there's no magic in this specific value. Just as well, for some data types, compression may be better with p-=(p>>4) or p-=(p>>6). In fact, I just use counters with multiplication instead, like (p-=(p*C)/SCALE), and optimize C value for sample data. > To observe the difference we tried plotting the probability graph and could > not get the difference they are talking about, As graph seems to be similar > in both cases (saturation is occurring in both cases).Please find attached > graph in both cases. There is a difference. Try testing with nonstationary data, eg. BWT output http://nishi.dreamhosters.com/u/BWT16.rar A counter like p = SCALE*n/(n+n); ; n++; would converge to a fixed value at some point (and also would have problems with n overflows when there's infinite volume of data). While a counter like p = (p*C) + bit*(1-C) can adapt to recently processed data.
    33 replies | 30695 view(s)
  • Sportman's Avatar
    13th June 2019, 18:40
    Sportman replied to a thread Crypto-Mining in Data Compression
    Indeed Malta, Cyprus and Luxembourg generate lowest <2.5um and <10um particulates by energy industry: http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_ac_aibrid_r2&lang=en http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_ac_ainah_r2&lang=en http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_air_emis&lang=en http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_air_gge&lang=en http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_ac_io10&lang=en http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=env_ac_aeint_r2&lang=en https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=sdg_13_10&plugin=1 https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=t2020_rd300&plugin=1 https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=t2020_30&plugin=1 https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=t2020_35&plugin=1 https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=sdg_13_20&plugin=1
    8 replies | 350 view(s)
  • Hariom's Avatar
    13th June 2019, 17:28
    Hello Shelwien, Thanks for your valuable reply. > practical probability scale is within 1<<10..1<<16 - > less than 10 bits is not enough to hold data statistics (worse compression), > more than 16 bits is much slower to handle, but doesn't improve compression anymore. Q1. 1<<10..1<<16 how it is derived? Is there any dependency on the number of possible symbols or range? >SCALE/minprob<(1<<8 Q2. I could not able to understand the idea behind this equation? Practically what does it mean? Q3. I read the article https://fgiesen.wordpress.com/2015/0...hmetic-coding/ and performed some calculation with respect to the number of bits in the probability model and got the results as shown in attached image NoOfBitsVsStepsTo31Prob.PNG. Is this observation have anything to do with the choice of probability model that we are using? Attached results are obtained by running below code: int main() { /* Here multiplication factor for 1024 can be increased to get the higher probability model */ int pb_0 = 1024 *1; while (pb_0 > 31) { pb_0 +=(pb_0)>>5; printf("%d\n",pb_0); } return 0; } Q4. In the article at https://fgiesen.wordpress.com/2015/0...hmetic-coding/ we found following observation: 1. For first approach of adaptive binary model with increasing number of count with the number of occurrence of the symbol : "This is the obvious construction. The problem with this is that probabilities move quickly after initialisation, but very soon start to calcify; after a few thousand symbols, any individual observation barely changes the probabilities at all. This sucks for heterogeneous data, or just data where the statistics change over time: we spend a long time coding with a model that’s a poor fit, trying to “unlearn” the previous stats." 2. For second better approach with update in the probability with adaptation rate : "A better approach for data with statistics that change over time is to gradually “forget” old stats. The resulting models are much quicker to respond to changed input characteristics, which is usually a decent win in practice." To observe the difference we tried plotting the probability graph and could not get the difference they are talking about, As graph seems to be similar in both cases (saturation is occurring in both cases).Please find attached graph in both cases. Regards, Hariom
    33 replies | 30695 view(s)
  • Jarek's Avatar
    13th June 2019, 16:31
    Jarek replied to a thread Crypto-Mining in Data Compression
    ... and we have to breath with this burnt coal, even worse from Poland ... http://www.buildup.eu/en/learn/tools/electricity-map-live-co2-emissions-european-electricity-consumption
    8 replies | 350 view(s)
  • Sportman's Avatar
    13th June 2019, 16:02
    Sportman replied to a thread Crypto-Mining in Data Compression
    Euro area electricity prices by type of user and country: https://ec.europa.eu/eurostat/tgm/table.do?tab=table&init=1&language=en&pcode=ten00117&plugin=1 Ukraine 0.0393 euro per kWh.
    8 replies | 350 view(s)
  • xezz's Avatar
    13th June 2019, 15:27
    xezz replied to a thread phafre in Data Compression
    if argument is nothing, usage is printed.
    6 replies | 2866 view(s)
  • zyzzle's Avatar
    13th June 2019, 15:24
    zyzzle replied to a thread Crypto-Mining in Data Compression
    Thought Germany had relatively cheap power. 0,27 Euro per kW/h is expensive, indeed. But where I live power costs are outrageous $0.52 per KW/h. Welcome to sunny Calif, USA! I'll not be mining any cryptocurrencies any time soon. The greedy power company sees to that impossibility.
    8 replies | 350 view(s)
  • Stephan Busch's Avatar
    13th June 2019, 11:44
    I had a payout to my bitcoin wallet so it doesn't seem to be scam in this regard. Energy prices are horrible here - about 0.27€ per KW/h
    8 replies | 350 view(s)
  • Sportman's Avatar
    13th June 2019, 11:21
    Sportman replied to a thread Crypto-Mining in Data Compression
    To create that affiliate link you gave them probably access to your social media data. Most cryptocurrencies and exchanges are probably illegal and if not now then January 2020 when AML (anti money laundering) laws are fully active. Crypto industry must comply with AML (anti money laundering) and KYC (know your customer) laws and also local SEC (securities exchange commission) laws what make most ICO's (initial coin offerings) illegal. All coins and services who have anonymous options can not comply like Bitcoin core (lightning), Bitcoin Cash, Monero, Zcash, PIVX, Comodo, Zcoin, NAV coin, Horizon, Verge, Dash and mixers etc. You must pay tax over gained crypto value. Mining is mostly not profitable only with special ASICs and sometimes with GPU's when energy prices are round 0.05$/euro or less per kWh. If you still want to mine for fun at a CPU some (CPU mining) coins: Yenten (YTN) http://yentencoin.info/ Electroneum (ETN) https://electroneum.com/ Magi (XMG) https://www.m-core.org/ Monero (XMR) https://web.getmonero.org/ For Monero there are already ASIC miners, they go to change algorithm to RandomX as last try: https://www.prnewswire.com/news-releases/monero-and-arweave-to-validate-the-proof-of-work-algorithm-randomx-300861697.html CPU mining software: https://github.com/JayDDee/cpuminer-opt Mine via a pool.
    8 replies | 350 view(s)
  • schnaader's Avatar
    13th June 2019, 09:02
    schnaader replied to a thread Crypto-Mining in Data Compression
    As a matter of fact, mining bitcoins nowadays (without using specialized hardware) is too expensive, the money you earn with it will only be a fraction of your power bill. If you want to invest in Bitcoins, don't put effort into mining, but just exchange some real money for Bitcoins, monitor the exchange rate and sell them later. Though even this is questionable, since the big rate jumps seem to be over. The specific browser you found looks like scam to me, I would recommend to stay away from it. Also, mining bitcoins with a smartphone is absurd and absolutely not worth it.
    8 replies | 350 view(s)
  • Stephan Busch's Avatar
    13th June 2019, 07:01
    This one seems to be off-topic, but I have found a browser which you can use to earn some money by mining bitcoins - it also seems to run on my android phone: https://get.cryptobrowser.site/6936209 Hat do you think absolut mining?
    8 replies | 350 view(s)
  • Sportman's Avatar
    13th June 2019, 03:15
    I lost unexpected domain control, because that also krc remote cloud option do not work anymore. Here a copy:
    48 replies | 22611 view(s)
  • Gotty's Avatar
    13th June 2019, 00:26
    Gotty replied to a thread paq8px in Data Compression
    - Modified Mixer, APM1, StateMap, APM, RunContextMap, SmallStationaryContextMap, StationaryMap, IndirectMap (with unchanged behavior), they are now independent of the global context - Modified most of the Model classes - small steps toward clarity and independence of the global context - Refactored Trainig (NormalModel, ExeModel) - Refactored APM and StateMap, StateMap now handles context sets (it processed single-contexts until now) - Scaling of stretched probabilities is now performed by *>> instead of */ in the maps - Fixed an Image8bitModel context (thank you Márcio) - Re-added distancemodel removed in v179fix2 - Other minor/cosmetic changes, fixed some method and variable names to follow naming conventions - Blended earlier removed changelogs from cpp to CHANGELOG Some of the changes are not final, this is still a transient version. Again - I just needed to commit before there are too many changes and it becomes difficult to diffview. As a side effect of the above changes, this version is faster than v179fix1 already up to 12.0%. But in compression it should be very similar to v179fix1.
    1615 replies | 462920 view(s)
  • fab's Avatar
    12th June 2019, 21:48
    You can use a set of learning rates increasing geometrically and aligned on powers of 10, such as 10^(n/4) where n = -16 ... -8.
    65 replies | 5080 view(s)
  • fab's Avatar
    12th June 2019, 21:27
    I confirm NNCP uses technique #2. I did not experiment with other techniques. For example, it may be interesting to test a backpropagation every time_steps / 2 instead of every time_steps (overlap of time_steps / 2 symbols). There are also several possibilities regarding the contribution of the overlapped symbols to the loss function. I don't see another way to implement batches (they are just independent streams of symbols used in a single backpropagation step). However, there may be models giving a higher parallelism with a smaller batch size. For example the Transformer model gives more parallelism than LSTM, but unfortunately it is only true for the encoding side.
    65 replies | 5080 view(s)
  • CompressMaster's Avatar
    12th June 2019, 20:45
    tefara, could you post your software or at least compressed sample?
    55 replies | 13582 view(s)
  • moisesmcardona's Avatar
    12th June 2019, 19:56
    moisesmcardona replied to a thread xeloz in Data Compression
    Probably it is compiled as a shared executable rather than static.
    37 replies | 17203 view(s)
  • CompressMaster's Avatar
    12th June 2019, 19:40
    CompressMaster replied to a thread Durilca Gui in Download Area
    2pact, could you post files in attachments if you have it? Thanks.
    4 replies | 6725 view(s)
  • CompressMaster's Avatar
    12th June 2019, 18:51
    Sportman, none of the links works. It would be better to post files directly as attachments instead of external links. Could you repair that? Thanks.
    48 replies | 22611 view(s)
  • CompressMaster's Avatar
    12th June 2019, 18:30
    CompressMaster replied to a thread xeloz in Data Compression
    Two DLLs are missing - libgcc_s_dw2-1.dll and libstdc++-6.dll. Also, program crashed with these options (maximal compression): xeloz c889 video.txt video.xeloz video.txt is a mp4 video converted to hexadecimal
    37 replies | 17203 view(s)
  • CompressMaster's Avatar
    12th June 2019, 18:12
    CompressMaster replied to a thread phafre in Data Compression
    I´ve tested phafre0.3 with maximal possible compression settings to compress 93 MB hexadecimal testfile, but ratio is only 53%. Are there any plans to new version?
    6 replies | 2866 view(s)
  • CompressMaster's Avatar
    12th June 2019, 17:43
    Thanks Shelwien! Now it works as expected. But I´ve prior used binary-hexadecimal (and vice versa) CMD converter, but it does not work properly because I´ve specified input in hexadecimal instead of binary. My mistake, sorry.
    6 replies | 217 view(s)
  • xezz's Avatar
    12th June 2019, 15:46
    xezz replied to a thread Alba in Data Compression
    My best result:alba e4096d512k:e2 jacmen_hex.txt out size: 1962419 to 1082458 (55.159372 %) This result is better than Re-Pair. So my answer is impossible!
    22 replies | 11480 view(s)
  • Shelwien's Avatar
    12th June 2019, 14:03
    > (1) In LZMA the estimated probability is presented as 11-bit unsigned integer value > What are the factors that we need to see when we are going ahead with this value? a) practical probability scale is within 1<<10..1<<16 - less than 10 bits is not enough to hold data statistics (worse compression), more than 16 bits is much slower to handle, but doesn't improve compression anymore. b) 11 bits is actually also too low, but there's an additional speed boost from being able to handle rc range update with a single if() rather than multiple (or a loop). SCALE/minprob<(1<<8), and with p-=p>>5 update, minprob is 31, so (1<<x)<256*31, x<13. Maybe Igor rounded down 31 to 1<<4 and thus got x=11 > (2) In LZMA they use total 2048 symbols, 0 and 1 symbols having initial probability equal to 0.5, These are probability values, there're still only 256 symbols (literals), while encoding is bitwise. > The calculation is different for the symbol 0 and symbol 1. > I could not find the reason why they are different. We have probability p of bit==0. It has to be increased if bit was actually 0, and decreased otherwise. "bound" is the lower bound of bit==1 range, so code<bound means bit=0. > (3) In this calculation they use a division factor 32, > I could not get the reason behind this as they are using 11 bit probability model. > image DivisionFactor.jpg can be seen for more details. p-=p>>5; means the same as p=p*(31/32); it should be possible to find a coefficient with slightly better compression, 1<<5 is chosen as a trade-off between compression and speed opt. Also p+=(SCALE-p)>>5; p=p+SCALE/32-p/32; p=p*(31/32)+SCALE/32; // bit==0: increase the probability p-=p>>5; p=p*(31/32)+0; // bit==1: decrease the probability This exponential weighting of data bits to produce a probability estimation is very important for compression of nonstationary data. > (4) And for normalization they have following code, they compare the range > with the (1<<24) value, Again i am not able to get the reason behind this. Full range scale is 32 bits. We want to work with bytewise i/o (which is the main difference between rangecoder and AC), so to add a byte we need the range to go below 32-8=24 bits. I'd suggest reading these: http://ctxmodel.net/rem-37.html "Introduction to Rangecoding" https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive-arithmetic-coding/
    33 replies | 30695 view(s)
  • Hariom's Avatar
    12th June 2019, 09:20
    Hello Shelwien, I came to know about the LZMA algorithm where they use range coding for encoding and decoding of the data, i saw there implementation and i have some doubts init. Can you please help me in getting them resolved? (1) In LZMA the estimated probability is presented as 11-bit unsigned integer value that represents the probability of symbol "0" in place of 32 bit integer, I am unable to get how this value is chosen. What are the factors that we need to see when we are going ahead with this value? (2) In LZMA they use total 2048 symbols, 0 and 1 symbols having initial probability equal to 0.5, But as they encode one symbol based upon the code value they update the probability with some calculated value. The calculation is different for the symbol 0 and symbol 1. I could not find the reason why they are different. For more details please see the attached image LzmaRangeDeCoder.jpg. (3) In this calculation they use a division factor 32, I could not get the reason behind this as they are using 11 bit probability model. image DivisionFactor.jpg can be seen for more details. (4) And for normalization they have following code, they compare the range with the (1<<24) value, Again i am not able to get the reason behind this. #define kTopValue ((UInt32)1 << 24) void CRangeDecoder::Normalize() { if (Range < kTopValue) { Range <<= 8; Code = (Code << 8 | InStream->ReadByte(); } } I request you to please help me in understanding these points. Regards, Hariom
    33 replies | 30695 view(s)
  • CompressMaster's Avatar
    11th June 2019, 19:36
    CompressMaster replied to a thread Alba in Data Compression
    I´ve tried brute force (best compression). Also, I´ve specified maximal blocksize manually, but it hurts comp ratio.
    22 replies | 11480 view(s)
  • Jarek's Avatar
    11th June 2019, 19:18
    I was trying to find something similar in literature, discuss, ask people if they have seen something similar, especially the average ... but nothing. So it seems it is a novel approach, the question is if it is promising? It gives cheap online estimation of Hessian from gradient sequence only, which is optimal in MSE sense with exponentially weakening weights of the old gradients - seems exactly what is needed, I don't know any approach even close to its possibilities (?) The closest is quasi-Newton, but exponential moving average seems much more better way to extract statistics than estimating Hessian from a few recent noisy gradients. I know it needs some serious testing, but this is only a general template with lots of details to choose and I am a lone theoretician - learning python and testing it is one of my plans for the summer break. Anyway, I think it is at least worth some deeper investigation (?), would gladly discuss it, find a collaboration ...
    21 replies | 622 view(s)
  • Stefan Atev's Avatar
    11th June 2019, 18:51
    I realize I am waving my hands in the air quite a lot here - BB is just the only method I know of that picks a step size for gradient descent that is not obviously a line search or a Newton-style method :) I have not seen a method that looks at the trend of the gradients to decide on a step size, but my experience is in more traditional (non-stochastic) optimization, where quasi-Newton/CG methods work quite well even if the problem is not quasi-convex, so I have had the luxury of not needing such trickery... Unfortunately trying to just do a search for "adaptive learning rate" is a very fraught exercise, there's way too much published, and a lot of it of unknown quality.
    21 replies | 622 view(s)
  • Jarek's Avatar
    11th June 2019, 17:49
    Thanks, interesting - on the first look seems quite far, but I will look closer. To get linear trend of gradients from least squares linear regression we need to update 4 (exponential moving) averages (g, x, xg, xx) - if you could think of some paper using such averages? Momentum uses average of g only, TONGA of gg^T (uncentered covariance matrix), but I haven't met another optimizer exploiting especially this linear trend: xg average ...?
    21 replies | 622 view(s)
  • Stefan Atev's Avatar
    11th June 2019, 17:23
    I am not sure I have seen this; The closest analog I can think of is the Barzilai-Borwein method, which people have already tried with SGD - it is a way to get step size without search. I personally have no experience with it. That gets you into non-monotonic methods (where individual iterations may make the cost or the norm of the gradient grow). (Edit: Here is a paper from very legit researchers on it: https://arxiv.org/pdf/1605.04131.pdf; again, I can only recommend it based on the authors, have not looked into it carefully).
    21 replies | 622 view(s)
  • Jarek's Avatar
    11th June 2019, 16:54
    ADAM has a few heuristics, but does not try to evaluate distance to minimum, what requires e.g. 2nd order model: of parabola in this considered direction. We can get it practically for free from linear trend of gradients in the considered direction (e.g. by updating 4 exponential moving averages) - ask where this linear trend intersects 0. I wasn't able to find such approach in literature - have you seen something like this?
    21 replies | 622 view(s)
  • dnd's Avatar
    11th June 2019, 16:48
    - AV1 Ecosystem Update: May 2019 - Amphion Semiconductor introduces 4K/UHD capable AV1 video decoder hardware - Android Q AV1 support - Discussion on Hacker News
    57 replies | 7738 view(s)
  • Stefan Atev's Avatar
    11th June 2019, 16:41
    Any learning rate control is "analogous" to line search - it serves a similar purpose; I did not say it is the same. https://arxiv.org/pdf/1902.09843.pdf shows some ways to improve ADAM and it explicitly talks about correcting ADAM's overzealous learning rate. A line search is just one method of step size control, which may be too expensive or a wash to be worth it (eating a few extra function/gradient evaluations to do the search may be just as expensive as just doing a few more iterations). Using momentum is one way of saying that your past update may have been too small (though not the full story either). Anyway, the literature on optimization is too vast, and I am not sure that's the forum to discuss it in. My professional experience with optimization has always led to (ab)using the unique structure of a problem to specialize a general purpose algorithm almost beyond recognition; it may very well turn out that the same is true for training neural networks - different architectures / cost functions may turn out to "prefer" different optimization algorithms, and that's what we call "job security" in the industry :)
    21 replies | 622 view(s)
  • dnd's Avatar
    11th June 2019, 15:57
    Yes, but It's known that the 2600k is easy overclockable with standard cpu air-cooling. The speed of memory is important for compression tasks and here the i7-7700K is using faster DDR4 against slower DDR3 for the i7-2600k. AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome But, actually no 7-zip or other compression benchmarks
    57 replies | 7738 view(s)
  • Jarek's Avatar
    11th June 2019, 15:50
    ADAM is momentum + strengthening underrepresented coordinates + maintaining inertia + bias for initializing exponential moving average ... nothing like line search. It doesn't try to model distance to minimum in considered direction - which is in linear trend of gradients, and can be cheaply added into consideration - to increase steps in plateau, decrease in steep valley. Adding online parabola model in considered direction is going toward line search ... without its additional cost if extracting this parabola from linear trend of gradients.
    21 replies | 622 view(s)
  • Stefan Atev's Avatar
    11th June 2019, 15:37
    I think ultimately you need to be able to answer why you think second order information would help you in a situation where you're operating away from the optimal solution most of the time. It may be over-simplifying things, but ADAM is also successful because of learning rate control, which is somewhat analogous to line search - making the most out of a descent direction faster (not taking steps that are too small). You do want individual batch updates to move you in the direction of the gradient (of the current batch) and likely orthogonal to previous gradients (to minimize how much you disturb the solution wrt previous batches). And that starts to look like CG :)
    21 replies | 622 view(s)
  • xezz's Avatar
    11th June 2019, 14:37
    xezz replied to a thread Alba in Data Compression
    What is your setting? And please attach testfile.
    22 replies | 11480 view(s)
  • Jarek's Avatar
    11th June 2019, 14:15
    You probably could maintain minimum while continuous transformation of parameter - e.g. just interlace tiny parameter step with gradient descent step (or use some implicit function optimization) ... The optimization has often many own (hyper-)parameters, starting with step size ... but ideally they often should be adapted along the optimization process ... maybe such optimization scheduling could be translated between problems ... Regarding random init, it is said that it's good to recreate original statistics - probability distributions of individual weights, maybe joint (I have a tool for joint: https://www.dropbox.com/s/7u6f2zpreph6j8o/rapid.pdf ) ... ... but previous result of optimization is probably a better initial point, maybe somehow perturbed. If by GA you mean gradient ascend (of likelihood), it is practically the same - just switch the sign. Regarding such parametric models, I have also started with just polynomials and it was disappointing ... but then I have added optimized powers e.g. |C-B|^0.8, |A-0.5|^4, |C-2B+D|^0.1 in this https://arxiv.org/pdf/1906.03238 ... and the performance has improved a lot. Generally, such e.g. power function would be put into a table, what allows to further optimize such tabled functions based on a dataset ...
    21 replies | 622 view(s)
  • Shelwien's Avatar
    11th June 2019, 13:36
    > often optimum found for one problem is used as the initial point of > optimizing for another problem. Yes, but that's too simple. I'm talking about something like observing that optimal values for some parameter are always in specific range... just with statistics. But tracking the optimal parameter values is not the only option - usually the optimization process itself would also have some constants/parameters. > there is a heuristics that initial weights should be chosen randomly, > accordingly to statistics of a given type of NN ... I try it sometime - these days I'm always running parameter optimization for some entropy models, so there're many options for experimenting. My current conclusion on random init is that it never works for me - optimizer always gets stuck at much worse results than what it finds starting from previous profiles, which evolved during model development. Of course, my optimizer is very far from gradient descent, its more like GA probably. Actually I did try to convert the entropy model to a function: https://encode.ru/threads/1522-Parametric-approximation-of-entropy(rate)-function But it mostly failed, as it wasn't precise enough, or fast enough.
    21 replies | 622 view(s)
  • Jarek's Avatar
    11th June 2019, 13:16
    Shelwien, 5 part answer ;-) 1) SGD requires statistics extraction as we assume that gradients are noisy. 2) This "optimal parameters of the same function, just with different input" sounds like context-dependent parametric models - neighboring thread: https://encode.ru/threads/3124-LOCO-I-(JPEG-LS)-uses-365-parameters-to-choose-width-any-alternatves 3) Regarding "observe the optimization process and improve it based on stats", often optimum found for one problem is used as the initial point of optimizing for another problem. For example in computer vision to train for a bit different type of images. 4) However, exploiting old optimization path for a new problem seems very problematic - the minimum of loss function is often not a single point, but huge due to: - symmetry - permutating parameters we often get analogous solution, - over-parametrization used in deep learning (# parameters >> data size, with regularization to focus on somehow smooth/reasonable models), makes that minima have mostly zero eigenvalues - making minimum practically a submanifold of nearly complete dimension - it seems tough to exploit useful statistical trends from old paths. 5) However, there is a heuristics that initial weights should be chosen randomly - accordingly to statistics of a given type of NN ...
    21 replies | 622 view(s)
  • Shelwien's Avatar
    11th June 2019, 12:54
    I always wondered if its possible to use statistics to improve optimizer convergence. I mean, quite frequently we have to find optimal parameters of the same function, just with different input, so it should be possible to observe the optimization process and improve it based on stats?
    21 replies | 622 view(s)
  • Jarek's Avatar
    11th June 2019, 12:07
    Generally the landscape in ML problems is extremely complex. A standard "stop and estimate Hessian" approach requires maintaining such single Hessian model for a longer time. In contrast, inexpensive online estimation, especially based only on the stochastic gradient sequence, would allow for better dependence on local situation. In how many dimensions 2nd order model would be the most practical? - in a single direction (d=1): we would get momentum-like method with additional online modelling of parabola in its single direction - allowing for smarter step size, e.g. half the distance to minimum of such modeled parabola, - complete Hessian (d=D) is completely impractical for ML dimensions in millions. Additionally, most of its eigenvalues are close to zero due to over-parametrization, making huge the problem of Hessian inversion, - I think d~10 is a reasonable compromise: adds multidimensional model and simultaneous optimization to d=1 in a comparable computational cost. Choosing these directions by some online PCA of recent gradients, they point where the action really is: sparse directions with curvature far from zero. So we could update (e.g. online PCA) d~10 size nearly orthogonal basis (v_i), use projections of gradient on this basis to update model of Hessian there, and we can use what has left from this gradient (D-d dim projection) for simultaneous 1st order optimization like gradient descent or something more sophisticated. For (online) Hessian estimation, I think linear regression of gradients seems promising: So in 1D, parabola has gradient (derivative): g = lambda(x-p), we can get lambda from linear regression - optimizing least squares error for (noisy!) gradients - using four averages: of , , and , we get lambda = ( - )/(-^2). For non-parabola case, we can replace averages with exponential moving averages - weakening weights of the old noisy gradients. In multiple dimensions we analogously update 4 exponential moving averages, but (like in momentum) and are vectors, , are matrices. Now linear regression Hessian estimation is from such covariance matrices: H = ( - ^T) ( - ^T)^-1 Have you seen something like this in literature? Being "better" means mainly faster training, what is extremely important especially for deep NN ... but also not attracting to saddles, not stagnating on suboptimal plateous - e.g. ADAM is known to often lead to suboptimal solution, the saddle-free Newton paper shows that other methods lead to "minima" with lots of strong negative eigenvalues ... Generalization is a separate problem, but related.
    21 replies | 622 view(s)
  • Stefan Atev's Avatar
    11th June 2019, 02:10
    I am not quite sure whether you're proposing this in the limited (d << D) case, where estimating the full Hessian is possible? In the D-dimensions, you can only afford an implicit Hessian (like CG, UKF), low-rank Hessian or inverse Hessian (limited memory quasi-Newton), or structured Hessian (if diagonal counts as "structure"; maybe for certain networks I can imagine a reasonable sparse Hessian). It seems that if you've already projected to a d-dimensional space, a lot of the noise will also be projected out, so maybe you don't need additional strategies to smooth the Hessian further. I could find several references on "stochastic" quasi-Newton, but I think none of them map directly to what you're trying to do - each was a specialized solution. The success of SGD and the importance of neural networks has not escaped notice in the numerical optimization circles, but sometimes the language is hard to translate between fields. In your opinion, what will be the benefit of "better" optimization methods - better generalization performance, or just better computational performance? For pure machine learning 1 is ultimately more important than 2, and I think the fact that sometimes more advanced optimization can lead to worse generalization performance has been a very frustrating realization for the deep network folks...
    21 replies | 622 view(s)
  • dado023's Avatar
    10th June 2019, 23:22
    its not a tweet, but it might be interesting to some forum members: https://community.centminmod.com/threads/round-3-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.17259/
    57 replies | 7738 view(s)
  • Shelwien's Avatar
    10th June 2019, 22:30
    Maybe this: https://github.com/m7821/Bin2Hex/releases https://sourceforge.net/projects/bin2hex/files/v.%200.1b/
    6 replies | 217 view(s)
  • Stefan Atev's Avatar
    10th June 2019, 22:27
    Yes, in a stochastic setting I have never seen anybody use CG . I personally think a good CG is on par with L-BFGS when using just a few vectors. I have used CG for non-convex problems, with the understanding that the local minima you end up is the one you get - I don't think any method can give you convergence guarantees on non-convex problems in general (except maybe in an "expected" sense). I think one of the papers I linked to basically links EKF methods with quasi-Newton; The UKF paper shows two ways of putting the network training problem into a state-space setting.
    21 replies | 622 view(s)
  • Jarek's Avatar
    10th June 2019, 21:56
    Ok, I wrongly imagined that context is suffix of the encoded bit sequence, but if it works for LOCO-I means that context sequence and encoded sequence can be separate. Regarding segmentation philosophy, it assumes that there is a classifier splitting an image into similar segments (e.g. sky, plants etc.) - for each of them we can use a separate model for given type of pattern, as experimentally there are large differences between models optimized for different images.
    24 replies | 901 view(s)
  • Jarek's Avatar
    10th June 2019, 21:44
    I have no experience with Kalman filters - will have to take a closer look. Stefan, both quasi-Newton (L-BFGS, SR1, DFP ..) and CG assume convexity - attract to saddles, and there is often exp(dim) of them in machine learning - isn't it a problem? CG seems unpopular in machine learning literature (?), but L-BFGS is quite common. However, beside the saddle attraction problem, it tries to estimate inverted Hessian based on recent ~10 stochastic gradients, which are very noisy - it seems numerically very unstable, what is also suggested by its requirement of line search (also CG). Here are some my thoughts on how to get successful 2nd order methods ( https://arxiv.org/pdf/1901.11457 ), I would gladly discuss: - perform 2nd order modelling only in a few (d << D) dimensional subspace - chosen e.g. by online PCA of recent gradients (suggesting where the action locally is). Additionally, we can still use the remaining (D-d) directions of gradients to simultaneously perform e.g. gradient descent for free there - combining multiple order methods, - control signs of curvatures to repel from saddles instead of attracting, what gives significant improvements shown by saddle-free Newton: https://arxiv.org/pdf/1406.2572 - Hessian should be estimated online to have better local dependence, preferably from gradient sequence alone. It can be done e.g. by linear regression of gradients (as Hessian is their linear trend), by updating four exponential moving averages: of g, x, gx, xx. Sure quasi-Newton can also estimate from gradients, but from a few - making it numerically unstable. Exponential moving average, linear regression should be much better for extracting statistical trends from noisy gradients.
    21 replies | 622 view(s)
  • thorfdbg's Avatar
    10th June 2019, 21:44
    Well, of course not. The adaption mechanism only guesses conditioned probabilities. The context captures such dependencies, and defines what the probabilities are conditioned on. For JPEG-LS, we have contexts that go cross lines, and so do the EBCOT contexts. Except that the latter has a more complex scanning pattern. Sorry, what is a segment? A block looks like a segment to me - or a potential segment. Not saying that this is ideal since it means that the segment bounds are static, and not adaptive.
    24 replies | 901 view(s)
More Activity