NNCP: Lossless Data Compression with Neural Networks
Similar to CMIX or the many paq-variants but seems to be more simple and efficient:
NNCP: Lossless Data Compression with Neural Networks
Similar to CMIX or the many paq-variants but seems to be more simple and efficient:
I managed to compile it on windows, but there seem to be issues with __fastcall convention differences
(full source is not available - there're binary libs compiled on linux).
But at least preprocessor should work - it does something interesting too.
Did anybody test lstm-compress with DRT-processed enwik?Code:NNCP 16,924,569 128,292,351 CMIX (v17) 14,877,373 116,394,271 durilca 16,209,167 127,377,411 lstm-compress v3 20,318,653 173,874,407
The NNCP paper is great! It contains a bunch of interesting research:
- There are several differences in the LSTM architecture NNCP uses compared to cmix/lstm-compress. I will try experimenting with using some of the ideas in cmix.
- The compression speed is really impressive: 10x faster than lstm-compress with comparable compression rate. It is unfortunate that the LibNC library is closed source. I tried hard to optimize the LSTM speed in cmix. With faster speed, the model size can be increased (which improves compression rate).
- It is interesting to see results from a transformer model. I am disappointed to see that it performs worse than LSTM! Despite the worse performance, I think it is still worth integrating into paq8/cmix. Since the architecture is so different from LSTM, the predictions from the two models would combine well together.
- I am surprised to see that the large models didn't use the DRT preprocessing from phda/cmix. I am planning to investigate the NNCP preprocessor - hopefully some of the ideas there can be used to improve the cmix preprocessor.
I'm still trying to run nncp on windows, by adding sysv_abi wrappers for functions used by binary library.
It mostly stopped simply crashing and now instead infinitely creates threads in a loop.
> The compression speed is really impressive
It seems to contain MT and vector optimizations.
I think he's trying to use this project to advertise his NN library, so seeing the source is unlikely.
> I am surprised to see that the large models didn't use the DRT preprocessing from phda/cmix
His preprocessor performs some kind of iterative dictionary optimization.
I tried running it like this: preprocess.exe c lpqdict0.dic enwik_text2 enwik_text2_nnt 44814 1
It took quite a while, but unexpectedly overwritten the dictionary file and tried to save something to /tmp/word1.txt, but failed because windows.
Kinda works, but there're memory leaks (it crashes if i do anything in free()) and I suspect it might require a cpu with AVX512 :)
At least it has a line "Your CPU does not support AVX2+FMA - exiting".
Btw, it seems to have problems with entropy coding? (see attached picture)
cdm can compress my test output from 1597 to 1571.
Can somebody reproduce the 16.9M result and upload the compressed file somewhere?
I am having trouble compiling it in Ubuntu:
I tried recompiling with -fPIC, but got the same error message. Let me know if someone figures out how to compile in Linux.Code:gcc -o nncp nncp.o cp_utils.o arith.o libnc.a -lm -lpthread /usr/bin/ld: libnc.a(libnc.o): relocation R_X86_64_32S against symbol `tab_mask32' can not be used when making a PIE object; recompile with -fPIC /usr/bin/ld: libnc.a(job.o): relocation R_X86_64_32 against `.text' can not be used when making a PIE object; recompile with -fPIC /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: error: ld returned 1 exit status Makefile:38: recipe for target 'nncp' failed
nncp.pdf, pag. 3: NNCP small uses 3 layers x 90 cells like lstm-compress, but it needs 12.1 M instead of lstm-compress 6864 KB / 6452 K (depending by the computer I use) to compress ENWIK8 (in LTCB page there is 9 MB).
> Kinda works, but there're memory leaks (it crashes if i do anything in free()) and I suspect it might require a cpu with AVX512 :)
> At least it has a line "Your CPU does not support AVX2+FMA - exiting".
nncp.pdf, pag. 5: "On x86 CPUs, the AVX2 + FMA instructions are mandatory in order to get correct performance.".
The only thing of interest here would be the LibNC library, which could help either speed up the LSTM model in cmix a lot, or allow us to run a much higher complexity model with about the same runtime. But since the library is closed-source, and that when Fabrice doesn't open-source something it usually means he has a commercial interest in it, I'm not holding my breath.Regarding the speed, our results show that our implementation is much faster than lstm-compress with a similar model and gain. The speed improvement comes from the more optimized matrix multiplication implementation and the larger batch size (16 instead of 1).
So, at least for now, the massive speedup is only possible on a very small subset of processors.On x86 CPUs, the AVX2 + FMA instructions are mandatory in order to get correct performance.
The most exciting outcome from that thread for me is that Eugene managed to use Linux .so library in Windows executable
Man, Fabrice is some kind of polymath genius. He never stops creating.
Compiled on debian without any problems, ELF binaries here: http://nishi.dreamhosters.com/u/nncp1.rar
As to compiling problems, it might help to
1) extract .o files from libnc.a (with 7-zip)
2) process them with agner objconv: https://www.agner.org/optimize/objconv.zip
objconv.exe -fcoff64 libnc.o libnc_c.o
objconv.exe -ew2036 -fcoff64 libnc.o libnc_c.o
Since 2019/04/13 there is an official version for Windows, it needs a CPU that supports AVX2+FMA:
Last edited by Mauro Vezzosi; 16th April 2019 at 11:08. Reason: Changed https://bellard.org/ to https://bellard.org/nncp/
There's also a .lib file in https://bellard.org/nncp/nncp-2019-04-13.tar.gz
But save_words_debug(s, "/tmp/word1.txt", char_freq, buf_size); still remains, so preprocessor doesn't save its output since /tmp doesn't exist.
And test script doesn't work properly either ("all" mode doesn't preprocess and then compress the file).
Got 16,984,458 with my script. Maybe he started with cmix -s output? Or result in paper needs -T 1?Code:set file=enwik8 set ThN=4 .\preprocess c _tmp_word.txt %file% _tmp_out.bin 16384 64 set large=-n_layer 5 -hidden_size 352 -n_symb 16388 -full_connect 1 -lr 6e-3 .\nncp -T %ThN% %large% c %file% _tmp_out-nncp.bin .\nncp -T %ThN% %large% d _tmp_out-nncp.bin _tmp_out-nncp.txt
Well, at least its not compressible - I guess that's only a problem with default options.
As to AVX512, there's https://software.intel.com/system/fi...11-win.tar.bz2
You can runCode:sde.exe -avx512 -- nncp.exe c input output
NNCP 2019/04/13, default options, without preprocessing.
nncp c InputFile OutputFile
cell=LSTM-C n_layer=4 hidden_size=352 batch_size=16 time_steps=20 n_symb=256 ln=1 fc=0 sgd_opt=adam lr=4.000e-003 beta1=0.000000 beta2=0.999900 eps=1.000e-005 n_params=5.28M n_params_nie=3.84M mem=79.1MB (Windows Task Manager reports 85900-88220 K)
Code:Maximum Compression 818.840 A10.jpg 1.189.590 AcroRd32.exe 465.370 english.dic 3.671.152 FlashMX.pdf 653.690 FP.LOG 1.496.712 MSO97.DLL 783.988 ohs.doc 706.281 rafale.bmp 547.304 vcfiu.hlp 485.039 world95.txt 10.817.966 Total ENWIK 39 ENWIK-1 (size 0) 139 ENWIK0 (*) 139 ENWIK1 (*) 236 ENWIK2 829 ENWIK3 4.795 ENWIK4 34.264 ENWIK5 260.738 ENWIK6 2.231.375 ENWIK7 (*) The NNCP decompression halts for all file size between 1 and 15 bytes, only once it wrote "Assertion failed!", "Program: ...\nncp.exe", "File: libnc.c, Line 4941", "Expression: c >= 0 && c < w->dims".
Last edited by Mauro Vezzosi; 19th April 2019 at 23:17. Reason: Added ENWIK
Some my tests on my testset compared to latest lstm-compress v3 and CMo7 -c7 option.
Looks that lstmc gives 0.7% better results than just lstm.
There also table with speed comparison according to threads used.
It looks that 3, 4 and 5 threads have similar speed with very slightly win of 4 threads.
Based on tips from Fabrice Bellard, I've tested MaximumCompression on nncp with option "-batch_size 8".
Scores are better than default setting ("-batch_size 16") - this options gains 100KB (about 1%). Here is a comparison with Mauro's default option scores:
1. Did he use cmix -s enwik8 as input for his preprocessor? His enwik8 result is better than what I've got from his test script.
2. Does the number of threads affect the compression?
> 1. Did he use cmix -s enwik8 as input for his preprocessor? His enwik8 result is better than what I've got from his test script.
Looks like he used his own processor.
4.2) Large models
./preprocess c out.words enwik8 out.pre 16384 64
./preprocess c out.words enwik9 out.pre 16384 512
LSTM large model:
./nncp -n_layer 5 -hidden_size 352 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.pre out.bin
From nncp.pdf: see 5.2 Subword based preprocessor
From nncp-2019-04-13.tar.gz: see preprocess.c
Maybe the version 2019/04/13 slightly improves the compression compared to what was reported in the first time?
> 2. Does the number of threads affect the compression?
I did a quick test (quad core, 2 hyperthreading) on ENWIK5:
-T 1 bps=2.741 (34264 bytes), 2.23 kB/s
-T 2 bps=2.741 (34264 bytes), 3.00 kB/s
-T 4 bps=2.741 (34264 bytes), 3.58 kB/s
-T 6 bps=2.741 (34264 bytes), 3.42 kB/s
-T 8 bps=2.741 (34264 bytes), 3.34 kB/s
Compressed files are identical for all -T.
The NNCP decompression halts for all file size between 1 and 15 bytes, I already wrote to Fabrice, he said he'll fix it.
Shelwien (20th April 2019)
pdf says "For the small LSTM model, we reused the text preprocessor of CMIX".
.\preprocess c _tmp_word.txt enwik8 out.bin 16384 64
.\nncp -T 4 -n_layer 5 -hidden_size 352 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.bin out-nncp.bin
like specified in readme file
and got 16,984,458, verified decoding too.
.pdf says result should be "LSTM (large) 16,924,569".
> pdf says "For the small LSTM model, we reused the text preprocessor of CMIX".
Yes, for LSTM small they reused cmix, and PDF says the compressed size is 20500039.
The compressed size 16924569 is for LSTM large (PDF table 3): PDF "5.2 Subword based preprocessor." says "The larger models use a preprocessor where each symbol represents a sequence of bytes. ..." and it's not the text preprocessor of CMIX (as far as I understood).
However, I don't know how to exactly achieve 16924569, I didn't try it.
Maybe -T doesn't hurt compression if the file size is small, but can it hurts on big file size?
I don't think if -T option could hurt compression on big files and doesn't for smaller - I've tested it for D.TGA which has 721'185 bytes and all scores were identical.
In my opinion it could be a builds differences - similarly like for cmix -> almost every build have slightly different compression for different files.
To improve compression ratio you could try to use "-batch_size" with lower values = 8, 6, 4 or even 2 which gives respectively for my whole testset 1.2%, 1.4%, 1.8% and about 2.2% (not yet finished) of gain. Of course with analogical decrease of speed. I don't know how could be enwik8/9 improvement, however some of my files (U.DOC, Q.WK3) got even more than 10% of gain to default version and also 4-10% of gain to LSTMC LARGE option.
Shelwien (21st April 2019)
Sorry I just realized I did not report the correct parameters for this particular result. The enwik9 result should be OK though. In order to get a matching result, you need to slightly increase the hidden size:
nncp -n_layer 5 -hidden_size 384 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.bin out-nncp.bin
I confirm that the "-T" option does not change the compressor output.
It is possible to get slightly better results with a larger model such as:
nncp -n_layer 5 -hidden_size 512 -n_symb 16388 -full_connect 1 -lr 6e-3 c out.bin out-nncp.bin
-> 16,791,077 bytes
1. Btw, DRT - the origin of cmix -s - (or more like its dictionary) has a special property.
Words in dictionary are ordered in such a way that similar words (morphologically or semantically)
end up also having same prefix and/or suffix bits.
That is, for words that occur in same context it makes matches several bits longer.
Does nncp preprocess have a similar effect?
2. There're some scripts for enwik preprocessing: https://encode.ru/threads/2590-Some-...ll=1#post59399
They can remove some layers of markup in enwik - maybe that can help with dictionary generation?
@fab - I've tryied to use preprocessor with latest cmix (v17) dictionary but it looks something goes not well...
Am I doing something wrong? Could the cmix dictionary be used with NNCP preprocessor?
If you want to use the CMIX dictionary then you need to use the CMIX preprocessor. The NNCP preprocessor is different because it produces an output with an alphabet of more than 256 symbols (first number on the "preprocess" command line) and because it builds its own dictionary based on the file content (so your command overwrites english.dic instead of using it !).
In my tests, the NNCP preprocessor gave significantly better results than the CMIX preprocessor with the LSTM or Transformer models. It comes from the fact each word (or subword) is stored on a single symbol in the NNCP case. In the CMIX case, a careful tuning of the word list ordering and of the byte encoding are needed to achieve good performance. The downside of the NNCP preprocessor is that the compressor must accept large alphabets in order to use it efficiently (hence a byte based compressor won't give good results).
nncp preprocessor is used like this:
preprocess c _tmp_word.txt enwik8 out.bin 16384 64
where "out.bin" is the result of enwik8 preprocessing (with 2 bytes per symbol when alphabet size is 16384),
and generated dictionary is written to _tmp_word.txt.
16384 here is the target dictionary size... but the final dictionary size somehow can be different,
in particular for enwik8 it ends up as 16387 (that's why -n_sym 16388 in nncp options).
As to algorithm, I think it iteratively finds a string with most occurrences and assigns it to a new symbol.
Most likely you can apply DRT / cmix -s first, then nncp preprocess.
Ok, I've typed the same command as you and there no preprocessing output file...
I've tried it to enwik6 and enwik8 files and there is the same communicate: "/tmp/word1.txt: No such file or directory".
The "_tmp_word_6.txt" file is generated - I think properly, but preprocessed file is gone - I've also tried with different names.
Example of command:
D:\Core\0000WORK.111\!00001compression\TESTSCORES\ NNCP>preprocess.exe c _tmp_word_6.txt Enw/enwik6 enwik6.nnc 16384 64
input: 1000000 bytes
after case/space preprocessing: 1107308 symbols
822 461688 445950
Number of words=822 Final length=461688
/tmp/word1.txt: No such file or directory