Page 1 of 2 12 LastLast
Results 1 to 30 of 58

Thread: lstm-compress

  1. #1
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts

    lstm-compress

    lstm-compress v2 2017/12/08 https://github.com/byronknoll/lstm-compress Byron Knoll
    From README.md:
    Data compression using LSTM.
    This project uses the same LSTM and preprocessing code as cmix (https://github.com/byronknoll/cmix).
    All of the other cmix models are removed, so the compression is performed using only LSTM.

    Code:
    Without     With        Silesia
    preproc.    preproc.    
     2,453,385   2,206,001  dickens
    13,077,727  13,072,025  mozilla
     1,964,125   1,964,125  mr
     1,519,511   1,540,465  nci
     2,057,222   1,872,869  ooffice
     2,199,701   2,199,701  osdb
       999,853   1,010,654  reymont
     4,304,758   4,080,359  samba
     3,947,236   3,947,236  sao
     6,387,286   5,945,903  webster
     3,572,818   3,572,818  x-ray
       476,967     428,096  xml
    42,960,589  41,840,252  Total
    
    Without     With        Maximum Compression
    preproc.    preproc.    
       824,846     824,846  A10.jpg
     1,300,386   1,235,289  AcroRd32.exe
       664,268     664,268  english.dic
     3,700,212   3,702,662  FlashMX.pdf
       558,044     535,198  FP.LOG
     1,596,914   1,562,334  MSO97.DLL
       882,943     851,990  ohs.doc
       696,018     672,844  rafale.bmp
       647,683     650,306  vcfiu.hlp
       618,766     495,703  world95.txt
    11,490,080  11,195,440  Total
    
    ENWIK8      ENWIK9       LTCB
    20,488,816  175,708,405  v1 http://mattmahoney.net/dc/text.html#1758
    20,494,577  174,868,709  v2 http://mattmahoney.net/dc/text.html#1192
    lstm-compress.exe: g++ (x86_64-win32-sjlj-rev1, Built by MinGW-W64 project) 7.1.0.
    Edit: Redone a static exe without external dependencies.

    To have fun with LSTM, here is the "LSTM Rock Paper Scissors" game: http://www.byronknoll.com/lstm.html.
    Attached Files Attached Files
    Last edited by Mauro Vezzosi; 25th December 2017 at 22:23. Reason: Added "Byron Knoll", Redone exe

  2. The Following 4 Users Say Thank You to Mauro Vezzosi For This Useful Post:

    Bulat Ziganshin (25th December 2017),byronknoll (30th December 2017),Darek (27th December 2017),encode (26th December 2017)

  3. #2
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    909
    Thanks
    531
    Thanked 359 Times in 267 Posts
    Here are scores for my testset and comparison to lstm-compress v1. There is a huge gain for A.TIF and B.TGA files.
    15'279'047 is avery good score for my testset - it's better than 7zip which have some models in use!
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	lstm_v2.jpg 
Views:	120 
Size:	560.8 KB 
ID:	5583  

  4. #3
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    I did these tests on January 2018.
    I changed one parameter at a time to lstm-compress to see what happen, but I've probably been too aggressive and the results could be of little use.
    I used Maximum Compression corpus because it is smaller than Silesia.
    Code:
    Standard lstm-compress internal parameters:
             3 layers
            90 nodes
            10 horizon (epochs)
          0.05 learning rate
    -0.2..+0.2 initial random weights range
      -2..  +2 gradients clip range
    
    "Std" means standard version, I repeated it in each table.
    
    Layer
             2      Std 3          4
       824.893    824.846    825.274 A10.jpg
     1.270.148  1.235.289  1.220.603 AcroRd32.exe
       698.073    664.268    665.193 english.dic
     3.705.551  3.702.662  3.699.935 FlashMX.pdf
       564.499    535.198    526.252 FP.LOG
     1.595.717  1.562.334  1.555.029 MSO97.DLL
       865.862    851.990    844.326 ohs.doc
       673.973    672.844    674.563 rafale.bmp
       678.698    650.306    636.665 vcfiu.hlp
       512.007    495.703    493.879 world95.txt
    11.389.421 11.195.440 11.141.719 Total
     2.195.558  2.171.508  2.162.758 ENWIK7
    
    Node
            45     Std 90        180
       824.818    824.846    825.954 A10.jpg
     1.299.285  1.235.289  1.189.902 AcroRd32.exe
       725.576    664.268    615.259 english.dic
     3.711.536  3.702.662  3.697.882 FlashMX.pdf
       603.050    535.198    504.157 FP.LOG
     1.628.691  1.562.334  1.537.240 MSO97.DLL
       898.390    851.990    828.610 ohs.doc
       677.693    672.844    674.042 rafale.bmp
       710.921    650.306    611.658 vcfiu.hlp
       532.676    495.703    477.217 world95.txt
    11.612.636 11.195.440 10.961.921 Total
     2.229.069  2.171.508  2.136.291 ENWIK7
    
    Horizon
             5     Std 10         20
       825.571    824.846    823.873 A10.jpg
     1.248.682  1.235.289  1.233.002 AcroRd32.exe
       685.174    664.268    623.567 english.dic
     3.707.083  3.702.662  3.700.346 FlashMX.pdf
       557.706    535.198    511.467 FP.LOG
     1.582.134  1.562.334  1.562.945 MSO97.DLL
       865.784    851.990    846.908 ohs.doc
       678.718    672.844    672.827 rafale.bmp
       669.939    650.306    646.661 vcfiu.hlp
       509.696    495.703    488.068 world95.txt
    11.330.487 11.195.440 11.109.664 Total
     2.192.842  2.171.508  2.160.483 ENWIK7
    
    Input gate learning rate
         0.025  Std 0.050      0.100
       825.243    824.846    824.490 A10.jpg
     1.240.088  1.235.289  1.228.870 AcroRd32.exe
       673.534    664.268    658.442 english.dic
     3.702.629  3.702.662  3.699.382 FlashMX.pdf
       539.431    535.198    538.460 FP.LOG
     1.572.458  1.562.334  1.558.738 MSO97.DLL
       853.688    851.990    847.368 ohs.doc
       674.458    672.844    672.761 rafale.bmp
       653.185    650.306    647.911 vcfiu.hlp
       499.044    495.703    493.831 world95.txt
    11.233.758 11.195.440 11.170.253 Total
     2.175.044  2.171.508  2.163.518 ENWIK7
    
    Forget gate learning rate
         0.025   Std 0.50      0.100
       825.412    824.846    824.376 A10.jpg
     1.241.715  1.235.289  1.232.403 AcroRd32.exe
       665.632    664.268    673.062 english.dic
     3.703.836  3.702.662  3.699.669 FlashMX.pdf
       537.641    535.198    531.376 FP.LOG
     1.570.718  1.562.334  1.560.168 MSO97.DLL
       857.868    851.990    849.022 ohs.doc
       674.245    672.844    671.883 rafale.bmp
       652.317    650.306    647.192 vcfiu.hlp
       499.919    495.703    494.786 world95.txt
    11.229.303 11.195.440 11.183.937 Total
     2.176.039  2.171.508  2.164.743 ENWIK7
    
    Output gate learning rate
         0.025  Std 0.050      0.100
       825.244    824.846    824.632 A10.jpg
     1.243.503  1.235.289  1.232.550 AcroRd32.exe
       669.327    664.268    668.970 english.dic
     3.702.970  3.702.662  3.699.719 FlashMX.pdf
       544.854    535.198    536.074 FP.LOG
     1.574.049  1.562.334  1.562.473 MSO97.DLL
       853.768    851.990    849.704 ohs.doc
       673.029    672.844    673.253 rafale.bmp
       649.142    650.306    654.354 vcfiu.hlp
       500.296    495.703    493.965 world95.txt
    11.236.182 11.195.440 11.195.694 Total
     2.173.522  2.171.508  2.165.215 ENWIK7
    
    Input node learning rate
         0.025  Std 0.050      0.100
       825.854    824.846    823.939 A10.jpg
     1.237.322  1.235.289  1.237.178 AcroRd32.exe
       663.542    664.268    677.624 english.dic
     3.703.944  3.702.662  3.699.921 FlashMX.pdf
       526.854    535.198    540.722 FP.LOG
     1.568.899  1.562.334  1.565.553 MSO97.DLL
       848.859    851.990    857.762 ohs.doc
       676.944    672.844    672.145 rafale.bmp
       642.909    650.306    661.331 vcfiu.hlp
       500.004    495.703    497.724 world95.txt
    11.195.131 11.195.440 11.233.899 Total
     2.175.106  2.171.508  2.171.284 ENWIK7
    
    Output layer learning rate
         0.025  Std 0.050      0.100
       825.180    824.846    826.603 A10.jpg
     1.254.039  1.235.289  1.234.911 AcroRd32.exe
       679.201    664.268    684.964 english.dic
     3.701.299  3.702.662  3.709.542 FlashMX.pdf
       553.844    535.198    531.468 FP.LOG
     1.585.219  1.562.334  1.562.857 MSO97.DLL
       863.229    851.990    849.971 ohs.doc
       671.989    672.844    681.788 rafale.bmp
       672.872    650.306    643.127 vcfiu.hlp
       506.386    495.703    497.077 world95.txt
    11.313.258 11.195.440 11.222.308 Total
     2.182.033  2.171.508  2.177.054 ENWIK7
    
    All learning rate
         0.025  Std 0.050      0.100
       827.311    824.846    824.692 A10.jpg
     1.268.166  1.235.289  1.218.508 AcroRd32.exe
       656.171    664.268    672.956 english.dic
     3.710.504  3.702.662  3.700.902 FlashMX.pdf
       556.194    535.198    532.121 FP.LOG
     1.604.756  1.562.334  1.545.841 MSO97.DLL
       875.327    851.990    846.725 ohs.doc
       677.561    672.844    678.543 rafale.bmp
       668.864    650.306    647.836 vcfiu.hlp
       518.758    495.703    490.964 world95.txt
    11.363.612 11.195.440 11.159.088 Total
     2.203.135  2.171.508  2.158.946 ENWIK7
    
    All learning rate, 3 l.r. in groups of 30 nodes
     Std 0.050 0.040,0.050,0.060 0.030,0.050,0.080 0.030..0.080 0.020..0.100
       824.846           825.001           824.522      826.314      827.466 A10.jpg
     1.235.289         1.240.437         1.239.921    1.246.916    1.266.856 AcroRd32.exe
       664.268           668.597           658.140      654.910      663.110 english.dic
     3.702.662         3.700.661         3.702.224    3.708.093    3.713.739 FlashMX.pdf
       535.198           535.151           539.187      538.921      553.054 FP.LOG
     1.562.334         1.565.003         1.569.173    1.583.380    1.600.967 MSO97.DLL
       851.990           852.253           851.856      857.540      868.689 ohs.doc
       672.844           674.525           674.738      678.601      683.470 rafale.bmp
       650.306           648.363           650.100      653.923      664.000 vcfiu.hlp
       495.703           500.461           500.531      505.881      515.531 world95.txt
    11.195.440        11.210.452        11.210.392   11.254.479   11.356.882 Total
     2.171.508         2.173.030         2.175.378    2.188.972    2.205.889 ENWIK7
    
    Initial random rage
     -0.1..0.1  -0.2..0.2  -0.4..0.4
       826.292    824.846    825.429 A10.jpg
     1.236.044  1.235.289  1.247.287 AcroRd32.exe
       674.524    664.268    682.571 english.dic
     3.702.590  3.702.662  3.701.957 FlashMX.pdf
       539.339    535.198    536.722 FP.LOG
     1.571.286  1.562.334  1.575.201 MSO97.DLL
       852.690    851.990    856.244 ohs.doc
       676.598    672.844    677.707 rafale.bmp
       648.386    650.306    653.344 vcfiu.hlp
       502.824    495.703    496.945 world95.txt
    11.230.573 11.195.440 11.253.407 Total
     2.172.169  2.171.508  2.179.717 ENWIK7
    
    Initial random range, 3 random ranges in groups of 30 nodes
    Std +/-0.2 +/-0.1,+/-0.2,+/-0.4
       824.846              825.167 A10.jpg
     1.235.289            1.243.915 AcroRd32.exe
       664.268              673.557 english.dic
     3.702.662            3.701.351 FlashMX.pdf
       535.198              542.438 FP.LOG
     1.562.334            1.572.495 MSO97.DLL
       851.990              850.608 ohs.doc
       672.844              674.595 rafale.bmp
       650.306              655.345 vcfiu.hlp
       495.703              501.277 world95.txt
    11.195.440           11.240.748 Total
     2.171.508            2.175.067 ENWIK7
    
    Clip gradients
         -1..1  Std -2..2      -4..4
       824.868    824.846    824.839 A10.jpg
     1.243.347  1.235.289  1.236.228 AcroRd32.exe
       670.369    664.268    673.663 english.dic
     3.701.987  3.702.662  3.702.697 FlashMX.pdf
       534.919    535.198    536.580 FP.LOG
     1.566.821  1.562.334  1.566.371 MSO97.DLL
       850.196    851.990    851.627 ohs.doc
       675.415    672.844    672.843 rafale.bmp
       650.971    650.306    651.628 vcfiu.hlp
       497.501    495.703    495.976 world95.txt
    11.216.394 11.195.440 11.212.452 Total
     2.172.235  2.171.508  2.171.777 ENWIK7

  5. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    byronknoll (6th June 2018)

  6. #4
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    Lstm-compress version 2018/05/20.
    Code:
    Without     With        Silesia
    preproc.    preproc.    
     2,452,210   2,212,281  dickens
    13,078,520  13,065,534  mozilla
     1,965,130   1,965,130  mr
     1,513,956   1,538,407  nci
     2,060,339   1,872,557  ooffice
     2,207,734   2,207,734  osdb
       993,409     993,374  reymont
     4,302,549   4,083,301  samba
     3,931,549   3,931,549  sao
     6,426,132   5,941,866  webster
     3,568,298   3,568,298  x-ray
       476,299     426,904  xml
    42,976,125  41,806,935  Total
    
    Without     With        Maximum Compression
    preproc.    preproc.    
       825,133     825,095  A10.jpg
     1,297,199   1,232,225  AcroRd32.exe
       668,131     668,131  english.dic
     3,700,283   3,701,982  FlashMX.pdf
       557,589     535,240  FP.LOG
     1,593,520   1,563,185  MSO97.DLL
       882,069     849,228  ohs.doc
       695,062     675,336  rafale.bmp
       645,482     651,544  vcfiu.hlp
       618,918     496,032  world95.txt
    11,483,386  11,195,440  Total
    
    Without     With       LTCB
    preproc.    preproc.    
            11          11 ENWIK1
            20          20 ENWIK1
            97          97 ENWIK2
           655         488 ENWIK3
         5,699       4,269 ENWIK4
        41,610      29,958 ENWIK5
       301,495     233,294 ENWIK6
     2,556,526   2,168,519 ENWIK7
    23,148,738  20,477,200 ENWIK8
    Could the attention gate (e.g. https://openreview.net/pdf?id=rJVruWZRW) improves lstm-compress/cmix?

    lstm-compress.exe: g++ (x86_64-win32-sjlj-rev1, Built by MinGW-W64 project) 7.1.0.
    Attached Files Attached Files

  7. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    byronknoll (6th June 2018)

  8. #5
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    Could the attention gate (e.g. https://openreview.net/pdf?id=rJVruWZRW) improves lstm-compress/cmix?

    Thanks for the link - I will add attention gate to my list of ideas to experiment with.


    Recently I got some nice gains by using the Adam optimizer in cmix's LSTM: https://github.com/byronknoll/cmix/c...0956b1ff62e59f


    Adam will probably lead to compression gains for lstm-compress as well (although it might need some tuning first).

  9. #6
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    I added Adam to lstm-compress and did some tests.
    Sometimes the simplified Adam (1) is quite better than the current full version lacking of the iteration counter "t" parameter (2) (A10.jpg, world95.txt, ENWIK1..6), but sometimes it's much worse (english.dic, FP.LOG, vcfiu.hlp, ENWIK7).
    (1) (*g) = (*m) / (sqrt(*v) + eps);
    (2) (*g) = ((*m) / (1 - beta1)) / (sqrt((*v) / (1 - beta2)) + eps);
    Code:
      Standard    Full Adam  Simple Adam  Maximum Compression (Without preprocessing)
       825,133      815,923      809,546  A10.jpg
     1,297,199    1,303,090    1,300,199  AcroRd32.exe
       668,131      666,712      763,544  english.dic
     3,700,283    3,687,604    3,689,230  FlashMX.pdf
       557,589      543,722      616,684  FP.LOG
     1,593,520    1,604,636    1,598,449  MSO97.DLL
       882,069      846,662      848,038  ohs.doc
       695,062      688,294      687,478  rafale.bmp
       645,482      621,777      635,215  vcfiu.hlp
       618,918      604,283      590,570  world95.txt
    11,483,386   11,382,703   11,538,953  Total
    
      Standard    Full Adam  Simple Adam  LTCB (Without preprocessing)
            11           11           11  ENWIK1
            20           20           20  ENWIK1
            97           97           97  ENWIK2
           655          627          603  ENWIK3
         5,699        5,114        4,771  ENWIK4
        41,610       39,792       36,713  ENWIK5
       301,495      300,665      284,276  ENWIK6
     2,556,526    2,537,854    2,584,992  ENWIK7
    Do you know DeepZip/NN_compression? https://github.com/kedartatwawadi/NN_compression
    Maybe there are some useful ideas to add to your list.

  10. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    byronknoll (8th June 2018)

  11. #7
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    Do you know DeepZip/NN_compression? https://github.com/kedartatwawadi/NN_compression
    Maybe there are some useful ideas to add to your list.
    Yeah. Since I live in the bay area, I actually met up with Kedar in person to talk about his work.

  12. #8
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    Some more tests.
    I don't know if I implemented Nadam correctly, it slightly improves compression compared to Adam (see the last column), except for FP.LOG and some ENWIK (full) and FlashMX.pdf (simplified).
    Full Nadam:
    (*g) = ((*m) / (1 - beta1) * beta1 + ((1 - beta1) * (*g)) / (1 - beta1)) / (sqrt((*v) / (1 - beta2)) + eps);
    Simplified Nadam:
    (*g) = ((*m) * beta1 + ((1 - beta1) * (*g))) / (sqrt(*v) + eps);

    Code:
      Standard    Full Adam    Full Adam    Full Adam    Full Adam   Full Nadam  Maximum Compression (Without preprocessing)
                   Standard  alpha 0.001  alpha 0.005     eps 1e-5     Standard
       825,133      815,923      821,236      809,624      816,173      815,861  A10.jpg
     1,297,199    1,303,090    1,349,198    1,290,425    1,305,676    1,301,153  AcroRd32.exe
       668,131      666,712      671,467      759,506      668,373      664,049  english.dic
     3,700,283    3,687,604    3,701,449    3,687,262    3,688,235    3,687,476  FlashMX.pdf
       557,589      543,722      549,866      590,271      542,995      544,453  FP.LOG
     1,593,520    1,604,636    1,650,936    1,591,849    1,606,619    1,603,351  MSO97.DLL
       882,069      846,662      879,364      841,102      844,852      846,548  ohs.doc
       695,062      688,294      702,894      683,766      687,957      685,976  rafale.bmp
       645,482      621,777      658,085      620,540      623,822      620,081  vcfiu.hlp
       618,918      604,283      654,013      582,851      603,721      603,055  world95.txt
    11,483,386   11,382,703   11,638,508   11,457,196   11,388,423   11,372,003  Total
    
      Standard  Simple Adam  Simple Adam  Simple Adam  Simple Adam Simple Nadam  Maximum Compression (Without preprocessing)
                   Standard  alpha 0.001  alpha 0.005     eps 1e-5     Standard
       825,133      809,546      812,516      813,849      809,540      809,390  A10.jpg
     1,297,199    1,300,199    1,290,529    1,365,068    1,298,681    1,292,535  AcroRd32.exe
       668,131      763,544      683,900      775,346      764,538      753,257  english.dic
     3,700,283    3,689,230    3,685,323    3,715,349    3,689,058    3,690,600  FlashMX.pdf
       557,589      616,684      554,711      754,591      617,587      616,366  FP.LOG
     1,593,520    1,598,449    1,593,124    1,659,388    1,593,734    1,592,039  MSO97.DLL
       882,069      848,038      839,446      900,941      850,375      847,765  ohs.doc
       695,062      687,478      683,659      713,696      687,616      686,527  rafale.bmp
       645,482      635,215      615,591      695,915      632,115      630,606  vcfiu.hlp
       618,918      590,570      584,763      660,872      591,138      588,596  world95.txt
    11,483,386   11,538,953   11,343,562   12,055,015   11,534,382   11,507,681  Total
    
      Standard    Full Adam    Full Adam    Full Adam    Full Adam   Full Nadam  LTCB (Without preprocessing)
                   Standard  alpha 0.001  alpha 0.005     eps 1e-5     Standard  
            11           11           11           11           11           11  ENWIK0
            20           20           20           20           20           20  ENWIK1
            97           97           97           97           98           97  ENWIK2
           655          627          632          615          628          631  ENWIK3
         5,699        5,114        5,454        4,799        5,108        5,061  ENWIK4
        41,610       39,792       42,886       37,337       39,709       39,613  ENWIK5
       301,495      300,665      323,103      284,807      301,259      300,470  ENWIK6
     2,556,526    2,537,854    2,630,145    2,561,338    2,540,938    2,539,363  ENWIK7
    
    
      Standard  Simple Adam  Simple Adam  Simple Adam  Simple Adam Simple Nadam  LTCB (Without preprocessing)
                  Standard   alpha 0.001  alpha 0.005     eps 1e-5     Standard
            11           11           11           11           11           11  ENWIK0
            20           20           20           20           20           20  ENWIK1
            97           97           98           98           97           97  ENWIK2
           655          603          626          573          602          597  ENWIK3
         5,699        4,771        4,929        4,678        4,756        4,692  ENWIK4
        41,610       36,713       38,317       36,547       36,626       36,630  ENWIK5
       301,495      284,276      290,016      293,842      284,496      282,647  ENWIK6
     2,556,526    2,584,992    2,529,729    2,803,015    2,574,447    2,583,714  ENWIK7

  13. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    byronknoll (15th June 2018)

  14. #9
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Thanks, I have switched over to using full Nadam. Here are some cmix benchmarks:

    150 LSTM cells:
    full adam (enwik8 ): 15057573
    full nadam (enwik8 ): 15063294
    simple adam (enwik8 ): 15136202
    simple nadam (enwik8 ): 15131470

    200 LSTM cells:
    full adam (enwik8 ): 15024811
    full nadam (enwik8 ): 15026674
    full adam (english.dic): 86663
    full nadam (english.dic): 86656
    full adam (enwik6): 178936
    full nadam (enwik6): 178858
    full adam (calgary.tar): 544258
    full nadam (calgary.tar): 544091

    I recently added another LSTM improvement to couple the input and forget gates: https://github.com/byronknoll/cmix/c...a93a1e6f793dfd
    Described here: http://colah.github.io/posts/2015-08...tanding-LSTMs/
    This improves cpu/memory usage, and slightly improves compression rate.

  15. The Following User Says Thank You to byronknoll For This Useful Post:

    Mauro Vezzosi (15th June 2018)

  16. #10
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    I copied lstm.* and lstm-layer.* from cmix to lstm-compress and I did some tests.

    Code:
    My previous tests (2018/06/13 22:30)  | New tests
      Standard    Full Adam   Full Nadam  |    Standard   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Maximum Compression (Without preprocessing)
                   Standard     Standard  |(as in cmix)  alpha 0.001  alpha 0.003  beta1 0.899  beta1 0.901  beta2 0.998 beta2 0.9999  beta1 0.899 beta1 0.901
                                          |                                                                                            beta2 0.998 beta2 0.9999
       825,133      815,923      815,861  |     815,710      819,988      812,517      815,777      815,639      814,074      819,935      814,126      819,881   A10.jpg
     1,297,199    1,303,090    1,301,153  |   1,295,702    1,338,486    1,282,695    1,295,434    1,294,972    1,288,432    1,356,389    1,289,730    1,355,481   AcroRd32.exe
       668,131      666,712      664,049  |     637,068      642,555      652,699      639,454      643,574      654,906      656,424      651,514      654,928   english.dic
     3,700,283    3,687,604    3,687,476  |   3,686,966    3,700,578    3,683,024    3,687,145    3,686,807    3,688,519    3,700,505    3,688,331    3,700,207   FlashMX.pdf
       557,589      543,722      544,453  |     545,113      545,272      555,538      542,082      544,985      563,433      558,220      561,117      558,503   FP.LOG
     1,593,520    1,604,636    1,603,351  |   1,602,946    1,642,796    1,588,674    1,604,050    1,603,408    1,594,552    1,660,287    1,593,978    1,659,800   MSO97.DLL
       882,069      846,662      846,548  |     847,539      879,142      836,859      846.001      847,050      842,393      893,848      841,226      894,405   ohs.doc
       695,062      688,294      685,976  |     681,159      693,529      679,742      681,136      681,137      680,040      693,412      680,141      693,416   rafale.bmp
       645,482      621,777      620,081  |     618,863      651,706      614,315      617,780      618,162      614,171      673,845      616,656      673,146   vcfiu.hlp
       618,918      604,283      603,055  |     609,602      659,701      594,806      609,971      608,845      598,693      673,224      598,810      672,114   world95.txt
    11,483,386   11,382,703   11,372,003  |  11,340,668   11,573,753   11,300,869   11,338,830   11,344,579   11,339,213   11,686,089   11,335,629   11,681,881   Total
                                          |
      Standard    Full Adam   Full Nadam  |    Standard   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   LTCB (Without preprocessing)
                   Standard     Standard  |(as in cmix)  alpha 0.001  alpha 0.003  beta1 0.899  beta1 0.901  beta2 0.998 beta2 0.9999  beta1 0.899 beta1 0.901
                                          |                                                                                            beta2 0.998 beta2 0.9999
            20           20           20  |          20           20           20           20           20           20           20           20           20   ENWIK1
            97           97           97  |         100          102          100          100          100          100          100          100          100   ENWIK2
           655          627          631  |         607          635          590          607          607          607          607          607          607   ENWIK3
         5,699        5,114        5,061  |       4,909        5,114        4,756        4,911        4,906        4,900        4,917        4,902        4,915   ENWIK4
        41,610       39,792       39,613  |      38,991       41,749       37,980       39,022       38,961       38,485       39,767       38,513       39,732   ENWIK5
       301,495      300,665      300,470  |     297,307      319,132      287,697      297,836      296,977      290,792      319,326      291,160      319,113   ENWIK6
     2,556,526    2,537,854    2,539,363  |   2,550,984    2,635,106    2,537,835    2,550,397    2,549,403    2,542,454    2,686,419    2,542,352    2,685,499   ENWIK7
                                          |
      Standard    Full Adam   Full Nadam  |    Standard   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   Full Nadam   ... (Without preprocessing)
                   Standard     Standard  |(as in cmix)  alpha 0.001  alpha 0.003  beta1 0.899  beta1 0.901  beta2 0.998 beta2 0.9999  beta1 0.899 beta1 0.901
                                          |                                                                                            beta2 0.998 beta2 0.9999
                                          |     835,371      879,547      822,474      836,156      835,217      824,464      898,413      826,286      897,445   Calgary_corpus.tar
                                          |     499,589      516,891      493,021      495,703      499,333      490,702      514,097      490,397      513,593   Canterbury_corpus.tar
                                          |     417,534      417,634      417,591      417,535      417,534      417,570      417,590      417,571      417,589   pi1000000.txt ("3." + 1000000 decimals)
                                          |     100,271      100,269      100,277      100,271      100,271      100,274      100,269      100,274      100,269   sharnd_challenge.dat (100000 random bytes)
                                          |     346,080      354,743      342,826      346,170      345,985      344,128      352,861      344,199      352,732   test-seed000-n100000.uiq2 (651227 UIQ2 bytes) (Generic Compression Benchmark, http://mattmahoney.net/dc/uiq/)
                                          |          46           46           46           46           46           46           46           46           46   Test_000 (0x00 * 256 KiB)
                                          |          48           49           48           48           48           48           48           48           48   Test_255 (0xff * 256 KiB)
    Then I did some changes "here and there" to Nadam to check what happens (I don't know if they are already known).
    Please note: I'm not a great mathematician, "I don't know what I did" and I can't prove that the new learning update converges to a local minimum or is stable.
    Nevertheless, some of these some changes seem to improve well some files (mainly english.dic and FP.LOG + Canterbury_corpus.tar) that are hard to improve without hurting most of the other files, and are just slightly better or worse on other files.
    Vnadam 1
    (*v) += (1 - beta2) * (*g) * (*g);
    (*t) -= alpha * (((*m) / (1 - beta1) * beta1 + ((1 - beta1) * (*g)) / (1 - beta1)) / (sqrt((*v) / (1 - beta2) * beta2 + ((1 - beta2) * (*g) * (*g)) / (1 - beta2)) + eps));
    Vnadam 2
    (*v) += (1 - beta2) * (*g) * (*g);
    (*t) -= alpha * (((*m) / (1 - beta1) * beta1 + ((1 - beta1) * (*g)) / (1 - beta1)) / (sqrt((*v) / (1 - beta2) * beta2) + eps));
    Vnadam 3
    (*v) += (1 - beta2) * (*g) * (*g);
    (*t) -= alpha * (((*m) / (1 - beta1) * beta1 + ((1 - beta1) * (*g)) / (1 - beta1)) / (sqrt((*v) / (1 - beta2) + ((1 - beta2) * (*g) * (*g)) / (1 - beta2)) + eps));
    For comparison, current Nadam is
    (*v) += (1 - beta2) * (*g) * (*g);
    (*t) -= alpha * (((*m) / (1 - beta1) * beta1 + ((1 - beta1) * (*g)) / (1 - beta1)) / (sqrt((*v) / (1 - beta2)) + eps));
    Code:
      Standard     Vnadam 1     Vnadam 2     Vnadam 3   Maximum Compression (Without preprocessing)
       815,710      815,579      815,707      815,583   A10.jpg
     1,295,702    1,296,016    1,295,876    1,295,558   AcroRd32.exe
       637,068      632,409      636,995      635,191   english.dic
     3,686,966    3,686,899    3,686,720    3,686,762   FlashMX.pdf
       545,113      539,017      541,740      532,961   FP.LOG
     1,602,946    1,601,811    1,603,847    1,602,013   MSO97.DLL
       847,539      846,546      846,312      848,462   ohs.doc
       681,159      682,383      681,142      682,379   rafale.bmp
       618,863      619,389      617,912      618,707   vcfiu.hlp
       609,602      609,293      609,624      609,427   world95.txt
    11,340,668   11,329,342   11,335,875   11,327,043   Total
    
      Standard     Vnadam 1     Vnadam 2     Vnadam 3   LTCB (Without preprocessing)
            20           20           20           20   ENWIK1
           100          101          100          101   ENWIK2
           607          607          607          607   ENWIK3
         4,909        4,902        4,909        4,902   ENWIK4
        38,991       38,988       38,989       38,991   ENWIK5
       297,307      297,627      297,236      297,652   ENWIK6
     2,550,984    2,549,420    2,550,070    2,550,763   ENWIK7
    
      Standard     Vnadam 1     Vnadam 2     Vnadam 3   ... (Without preprocessing)
       835,371      835,525      835,620      835,056   Calgary_corpus.tar
       499,589      489,246      491,975      490,562   Canterbury_corpus.tar
       417,534      417,546      417,534      417,546   pi1000000.txt ("3." + 1000000 pi decimals)
       100,271      100,271      100,271      100,272   sharnd_challenge.dat (100000 random bytes)
       346,080      346,265      346,074      346,270   test-seed000-n100000.uiq2 (651227 UIQ2 bytes) (Generic Compression Benchmark, http://mattmahoney.net/dc/uiq/)
            46           46           46           46   Test_000 (0x00 * 256 KiB)
            48           48           48           48   Test_255 (0xff * 256 KiB)
    Then I saw that the Nadam and all my new learning update can be simplified:
    ((1 - beta1) * (*g)) / (1 - beta1) ~= *g
    This is due to the fact I stripped the "t" parameter, like in the original cmix formula.
    ((1 - beta1) * (*g)) / (1 - beta1) should be, if I'm not wrong, ((1 - beta1) * (*g)) / pow(1 - beta1, t)
    Changing ((1 - beta1) * (*g)) / (1 - beta1) with *g gives the same results on most files, but sometimes there is a noticeable difference (Canterbury_corpus.tar, FP.LOG).
    Code:
        Standard      Nadam
    (as in cmix)    only *g
         835,371    835,365 Calgary_corpus.tar
         499,589    501,841 Canterbury_corpus.tar
         637,068    636,621 english.dic
             607        607 ENWIK3
           4,909      4,909 ENWIK4
          38,991     38,991 ENWIK5
         545,113    542,377 FP.LOG
         417,534    417,534 pi1000000.txt
         100,271    100,271 sharnd_challenge.dat
         346,080    346,080 test-seed000-n100000.uiq2

  17. The Following User Says Thank You to Mauro Vezzosi For This Useful Post:

    byronknoll (2nd July 2018)

  18. #11
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Hello Brian,

    Thanks for your efforts regarding the LSTM-compressor
    I'm currently experimenting with neural networks for lossless audio compression and have some questions regarding network training.

    In a recurrent network you have 4 types of training when performing time series prediction and i have a hard time figuring out which one you are using.

    Type1 (one-to-one): feed the network the past samples via a window="tapped delay line" where past samples are just input-features of the network
    as you update the network weights after each sample (stateful BPTT with batchsize of 1 or classic backpropagation) the context (=hidden state in RNN) is correctly propagated forward in time

    Type2 (one-to-one): you do not use a window and feed the network sample by sample and let the reccurence do its magic (and find temporal patterns)

    Type3 (one-to-one): you feed in a batch of samples and use BPTT epoch times to update the weights, after one batch the context state is resetted

    Type4 (many-to-one): similar to Type2 but you reset the context state after each sample.
    To build the context state for a new sample you feed the network the past samples without forwarding the output-layer.
    At the last time step (within one sample) you forward the output-layer and update the weights.

    In comparison with the other types you work with "fresh weights" for building the context-state.

    In my experiments the differences are small but you can tune type3 with many epochs to improve prediction (at the cost of massive time increase).

    Some numbers on 16 test files using a classic RNN-net, window method, one hidden layer (10 nodes, LeakyReLu activation) and ADAM optimization:
    Code:
    Type 1:
    28.299.306 Bytes
    
    Type 3: batchsize 10, epochs per batch: 10
    28.311.422 Bytes
    Increasing the number of hidden-layers only works, if I add the possibility of forward propagation of the identity function (residual nets or highway nets).
    Code:
    1 hidden "feed-forward only" layer with 4-nodes: y=f(w.x+b)
    28.345.735 Bytes
    
    10 hidden "feed-forward only" layers with 4-nodes: y=f(w.x+b)
    28.518.367 Bytes
    
    10 hidden "feed-forward only" layers with 4-nodes using residual modelling: y=f(w.x+b)+x
    28.328.903 Bytes
    With regards
    Sebastian
    Last edited by Sebastian; 2nd July 2018 at 14:11.

  19. #12
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    @Sebastian, perhaps I don't fully understand the four types that you described, but I don't think I am using any of those four types.

    If I understand correctly, type1 and type2 only use a horizon size (number of time steps for truncated BPTT) of 1. cmix currently uses a horizon size of 40.

    For type3 and type4, I don't understand why the context state is reset. cmix never resets the context state. It seems like doing so is not desirable, since it would just lose information.

    What type would you consider the network used here? https://arxiv.org/pdf/1308.0850.pdf . This uses BPTT with horizon size of 100, and the context state is reset every 10,000 characters (again, I am not sure why the context state is reset). cmix uses the same layer architecture (figure 1) as described in the paper. However, it uses a different learning strategy.

    I posted a related question to stackexchange: https://datascience.stackexchange.co...n-through-time
    The network in the Alex Graves paper uses technique#1, and cmix uses technique#2. With technique#2, there is one input+output per time step and the weights are only updated every N time steps (N=40 for cmix).

  20. #13
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Somehow I misunderstand the word "horizon". For me it is something in the future but it seems this is just your "batch-size"=looking in the past.
    I update the weights at every sample not just every 40. The context-state keeps rolling forward.

    So the big-question now is: decreasing the horizon size should increase the compression?! If not I did not understand what you're doing.

    technique#1 is "type4"...it is the most accurate one for gradient estimation and you perfectly show a many-to-many model in your stackexchange question.
    In my opinion its only really useful when you keep shuffling samples when training.

    it avoids the following problem which imo occurs with your approach:
    use set of weights W1 and some samples S, context state H0 at the beginning and H1 at the end of the batch.
    after you do BPTT on the (in your case 40) samples you have a new set of weights W2 but for the first sample in the new batch you use context state H1 (which was the result of weights W1).
    It is slightly better to reset the context-state and feed the network timesteps of the current sample.

    I think there is a common confusion regarding "timesteps"
    In the literature a timestep is not the next sample in a coding sequence but a step within one sample.
    For "abc", "a", "b" and "c" are samples at timestep t and you code the following sequences
    __|a
    _a|b
    ab|c

    the batch-size would still be 1! The sequences "__", "_a" and "ab" are used to build the context-state for "a", "b" and "c".
    Thats why I don't understand the use of BPTT at all if you do "online-prediction" and don't have training data to build a static model.

    Understanding Stateful LSTM Recurrent Neural Networks

    Maybe you understand it better than me. Especially the paragraph "Stateful LSTM for a One-Char to One-Char Mapping".

    Here is a test for what I think you're doing:

    encode samples in batches->update weights with BPTT->encode next batch of samples
    Code:
    batch-size  1:1.673.928 (context reset after each batch)
    batch-size  1:1.652.156 (12 seconds)
    batch-size  5:1.672.581 (6 seconds)
    batch-size 10:1.690.307 (6 seconds)
    batch-size 10:1.649.446 (20 seconds, 10 epochs)

  21. #14
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Sebastian View Post
    So the big-question now is: decreasing the horizon size should increase the compression?! If not I did not understand what you're doing.
    No. For technique#2, changing the horizon size has no impact on the running time. So that parameter has been tuned purely to maximize compression. A horizon size of 1 doesn't work well because without the BPTT the network would not be good at learning long-range dependencies. Setting the horizon size to some huge number also doesn't work well because the weights would get updated less frequently.

  22. #15
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    But if "the weights would get updated less frequently" running time should decrease, especially with something like ADAM.
    I also don't understand why a horizon size of 1 should not work well, because you keep the context-state and optimize by the current sample (stochastic gradient descent).
    But maybe you can't compare my tests to character level encoding.

    Regarding Type1 and 2. in a feed-forward network you feed the past as inputs, so for "abc", when encoding "c": "b" and "a" are input at the input-layer. In reccurrent networks you feed sequences.

  23. #16
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Sebastian View Post
    But if "the weights would get updated less frequently" running time should decrease, especially with something like ADAM.
    Ah, you are right. However, the cost of doing the forward/backward passes is much higher than the weight update step. Looking back over some of my previous benchmarks, I don't see a significant time difference using different horizon sizes. I haven't tried tuning the horizon size since switching to Adam - I guess since the weight update step is more expensive now, it might make sense for me to run some benchmarks re-tuning it.

    Quote Originally Posted by Sebastian View Post
    I also don't understand why a horizon size of 1 should not work well, because you keep the context-state and optimize by the current sample (stochastic gradient descent).
    But maybe you can't compare my tests to character level encoding.
    Backpropagating through multiple time steps definitely helps - that is why BPTT is used for recurrent networks. It is true that a network can learn patterns longer than the horizon size. For example, a character-level LSTM model can learn to close brackets even when the distance between them is longer than the horizon size. However, backpropagating through multiple time steps makes it easier for it to learn these types of long-range dependencies. LSTMs were developed to help deal with the vanishing gradient problem when running BPTT through multiple time steps.

    BTW, would it be easy to run lstm-compress on your lossless audio data? It would be great to get some performance comparisons of lstm-compress with other RNN implementations.

  24. #17
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Currently I use neural networks with L2-regression on the output (for testing). I have code for stacking layers of MLP,RNN,MLP+Residual and GRU.
    I don't know if it is better to directly encode the network output via softmax and a arithmetic coder. Currently the residuals from prediction are fed into a complicated PAQ-like model.

    Feeding the raw audio samples to the network like you do in lstm-compress is an extremly bad idea because audio samples have strong autoregressive features (which NNs are bad at modelling).
    There is practically no difference between RNN and MLP (with GRU being not better than RNN, the gates somehow approach 0.5 really fast).

    That being said, I archive the best performance, when I use the NN to model the residual of a Linear Autoregressive Predictor. (NN=33mb, LPC=28.5mb, LPC+NN=28.2mb)
    There was already someone with that idea before: A new hybrid methodology for nonlinear time series forecasting

    I will take a look at convolutional nets.

  25. #18
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Quote Originally Posted by Sebastian View Post
    because audio samples have strong autoregressive features (which NNs are bad at modelling).
    Let me give you an example for comparison on one sample file: female_speech.wav (3.384.860 Bytes):
    In both variants I use a 8/4 configuration (8 samples from channel 0 and 4 from channel 1).

    Variant 1: Linear autoregression y=a_1*x_(t-1)+a_2*x_(t-2)...., the system is solved every k=32 samples through cholesky decomposition (the covariance matrix is updated for every sample)
    Variant 2: Feed forward net with a single layer, one neuron and linear activation: y=a_1*x_(t-1)+a_2*x_(t-2)..., obviously the same system but optimized with SGD-ADAM
    Variant 3: We add a single hidden layer with 12 neurons and rectified linear activation units, the system is now completly non-linear
    Variant 4: The hidden layer can model the identity function (MLP+Residual)

    Code:
    Variant 1:   984.778 Bytes (3 seconds)
    Variant 2: 1.227.168 Bytes (4 seconds)
    Variant 3: 1.316.996 Bytes (8 seconds)
    Variant 4: 1.305.595 Bytes (8 seconds)
    As you see, better performance can only be reached with better learning methods and clever skipping of connections.
    SGD Variants are imo mainly useful in batch-mode, where you iterate numerous times over the training data.

  26. #19
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    The wavenet architecture might work well for audio compression: https://arxiv.org/pdf/1609.03499.pdf

  27. #20
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    Thanks for the link, maybe you should try multiple passes of ADAM over the horizon window. This definitely helps compression.

  28. #21
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    I wanted to test the effect of the two different learning techniques you posted in your stackexchange question - maybe this is helpful for you.

    Test on lossless audio coding (atrain.wav: 3.377.110 Bytes):
    8/4 linear autoregressive prediction solved every k=32 samples via cholesky decomposition
    the residual of the linear predictor is fed into a simple recurrent network: 1 input-node, 4 hidden-nodes (ReLu_Leaky), 1 output node (Linear)
    L2 cost function

    Variant 1 (technique #2 from your question)
    The forward state of the network is collected in a window of size n for every sample and solved via BPTT.
    "ResetState" means, that we start with an empty-state for every new window.
    "RollState" means, that the network is kept stateful: the last state from the previous window is used for the first sample of the new window in forward and gradient calculation

    Weights are only updated every n samples

    Results:
    Code:
    BPTT 1, ResetState: 1.631.740
    BPTT 4, ResetState: 1.630.019
    BPTT 8, ResetState: 1.628.317
    BPTT 16,ResetState: 1.629.298
    
    BPTT 1, RollState:  1.625.147
    BPTT 4, RollState:  1.624.289
    BPTT 8, RollState:  1.623.649
    BPTT 16,RollState:  1.626.318
    The results show for this specific file, that increasing the window-size increases compression up to a certain point (audiofiles are highly non-static).
    Keeping the network stateful also helps for this small window sizes.


    Variant 2 (technique #1 from your question)
    We do sequence modelling. For this we feed past samples of the current timepoint to the network.
    For (n-1) past samples we do not forward the hidden layer to the output layer.
    The delta at the output layer is thus only calculated for the last of the n samples!

    That means in variant 1 we calculate the delta at the hidden layer as: delta=(next_higher_delta+future_hidden_delta)*s'(o utput)
    In Variant 2 the gradient for (n-1) samples flows only from the future, thus: delta=future_hidden_delta*s'(output)

    Keeping the network stateful is tricky. The correct state for the next sample is the output of the first sample of the currents sample history.

    Weights are then updated for every sample, irrespective of the window size.

    Results:
    Code:
    BPTT 4, ResetState: 1.623.696
    BPTT 8, ResetState: 1.614.328
    BPTT 16,ResetState: 1.612.785
    BPTT 32,ResetState: 1.613.161
    
    BPTT 4, RollState:  1.616.056
    BPTT 8, RollState:  1.612.169
    BPTT 16,RollState:  1.612.463
    BPTT 32,RollState:  1.613.567
    Sequence modelling indeed helps and keeping the network stateful also helps for small window sizes.

  29. #22
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Sebastian View Post
    I wanted to test the effect of the two different learning techniques you posted in your stackexchange question - maybe this is helpful for you.
    Thanks, it is helpful. The main advantage of technique#2 is the improved computational complexity. For a fixed amount of CPU time, I think technique#2 can get better results. The speed vs compression trade-off can be adjusted with other parameters like:

    1) Changing the number of LSTM cells
    2) Changing the number of layers
    3) Creating an ensemble of N networks, and averaging them together
    Last edited by byronknoll; 6th July 2018 at 00:53.

  30. #23
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    Then I did some changes "here and there" to Nadam to check what happens (I don't know if they are already known).
    Thanks, I have switched to using a simplified version of Vnadam 1: https://github.com/byronknoll/cmix/c...1c2cbaa89253d4

  31. The Following User Says Thank You to byronknoll For This Useful Post:

    Mauro Vezzosi (17th July 2018)

  32. #24
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    275
    Thanks
    6
    Thanked 23 Times in 16 Posts
    I'm wondering why you do the bias-correction via 1/(1-beta), when it should be 1/(1-beta^t)
    There is a clear detoriation in performance when you don't do the correction right

  33. #25
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Sebastian View Post
    I'm wondering why you do the bias-correction via 1/(1-beta), when it should be 1/(1-beta^t)
    There is a clear detoriation in performance when you don't do the correction right
    Whoops, that was a bug in my Adam implementation. Here is the fix: https://github.com/byronknoll/cmix/c...a0b307f58a0dd9
    Thanks for pointing this out. From my benchmarking so far, this is significantly better.

  34. #26
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    Maybe I need to write better what I mean because I already wrote about "t":
    Quote Originally Posted by Mauro Vezzosi View Post
    Sometimes the simplified Adam (1) is quite better than the current full version lacking of the iteration counter "t" parameter (2)
    Quote Originally Posted by Mauro Vezzosi View Post
    This is due to the fact I stripped the "t" parameter, like in the original cmix formula.
    You update LSTM faster than my tests, I need to restart them again.

  35. #27
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    Quote Originally Posted by Mauro Vezzosi View Post
    Maybe I need to write better what I mean because I already wrote about "t":
    Yeah, I should have realized my mistake when you first mentioned the "t" parameter. For some reason I didn't notice the bug until I saw Sebastian's post. Sorry about that.

  36. #28
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    223
    Thanks
    106
    Thanked 102 Times in 63 Posts
    I think this should be the weight update equation for Nadam:

      (*w) -= alpha * ((((beta1 * (*m)) / (1 - pow(beta1, t + 1))) +
    (((1 - beta1) * (*g)) / (1 - pow(beta1, t)))) /
    (sqrt((beta2 * (*v)) / (1 - pow(beta2, t)) + eps)));

    Based on this paper: https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ

    Let me know if something looks wrong.

  37. #29
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    Now I'm in a hurry, later I'll go deeper.
    My previous Nadam was buggy, now I tested this
    (*w) -= alpha * (((((*m) / (1 - float(pow(beta1, t)))) * beta1) + ((1 - beta1) * (*g)) / (1 - float(pow(beta1, t)))) / (sqrt((*v) / (1 - float(pow(beta2, t)))) + eps));
    which seems the same as your except where I use t instead of t + 1.
    It's based on http://ruder.io/optimizing-gradient-descent/
    Adam seems good on most file, but terrible on english.dic and FP.log (and bit worse on ENWIK7).
    Nadam seems little bit better, but still bad on english.dic and FP.log.

    Edit 1: There ia s difference: my version has "(*v)", your "beta2 * (*v)"; my has "sqrt(x) + eps", your "sqrt(x + eps)" (I already tested the 2 versions about eps, later I'll post the results).
    Edit 2: My previous Nadam was not buggy, only the "t" parameter was missing compared to my current implementation (which may still be faulty).
    Last edited by Mauro Vezzosi; 20th July 2018 at 23:00. Reason: Found 2 differences

  38. #30
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    216
    Thanks
    97
    Thanked 128 Times in 92 Posts
    The weight decay seems of little use in lstm-compress.
    Your Nadam is better than mine.
    Test "Nadam Byron" is still running.
    Code:
        Before Weight decay Fix decay bug Adam bugfix Nadam Byron Nadam Byron Nadam Mauro Nadam Mauro
        change                                        sqrt(v)+eps sqrt(v+eps) sqrt(v)+eps sqrt(v+eps)
                 2018/07/10   2018/07/15   2018/07/18  2018/07/20  2018/07/20  2018/07/20  2018/07/20
                                                                                                       Maximum Compression (Without preprocessing)
       815,710      815,711      815,710      809,867     809.604     809,681     809,265              A10.jpg
     1,295,702    1,293,714    1,295,629    1,290,465   1,286,699   1,288,380   1,288,396              AcroRd32.exe
       637,068      638,235      636,570      755,510     737,045     726,025     740,460     749,884  english.dic
     3,686,966    3,687,226    3,686,977    3,693,194   3,693,342   3,688,052   3,692,158              FlashMX.pdf
       545,113      541,624      545,785      618,412     616,904     619,737     622,065              FP.LOG
     1,602,946    1,602,488    1,602,834    1,594,534   1,587,399   1,592,081   1,591,074              MSO97.DLL
       847,539      845,466      846,010      847,012     847,028     845,534     849,175              ohs.doc
       681,159      681,249      681,164      683,730     682,788     682,440     682,185              rafale.bmp
       618,863      618,559      619,010      629,510     624,834     626,757     624,909              vcfiu.hlp
       609,602      609,716      609,673      598,329     600,720     597,535     598,189              world95.txt
    11,340,668   11,333,988   11,339,362   11,520,563  11,486,363  11,476,222  11,497,876              Total
                                                                                                       LTCB (Without preprocessing)
            20           20           20           20          20          20          20              ENWIK1
           100          100           99          103         102         103         101              ENWIK2
           607          607          607          624         616         626         613         624  ENWIK3
         4,909        4,909        4,909        4,834       4,814       4,834       4,824       4,838  ENWIK4
        38,991       38,992       38,991       36,782      36,611      36,741      36,644      36,728  ENWIK5
       297,307      297,314      297,307      281,914     281,285     281,702     281,168              ENWIK6
     2,550,984    2,549,770    2,550,853    2,602,622   2,587,635   2,597,889   2,596,981              ENWIK7
                                                                                                       ... (Without preprocessing)
       835,371      835,302      835,481      819,265     817,030     817,880     817,380     818,087  Calgary_corpus.tar
       499,589      501,655      499,772      489,795     488,510     481,542     488,804     488,866  Canterbury_corpus.tar
       417,534      417,534      417,534      419,586     419,256     418,357     419,143     418,233  pi1000000.txt ("3." + 1000000 pi decimals)
       100,271      100,271      100,271      100,288     100,287     100,288     100,286     100,288  sharnd_challenge.dat (100000 random bytes)
       346,080      346,090      346,080      341,213     341,827     341,447     341,583     341,416  test-seed000-n100000.uiq2 (651227 UIQ2 bytes) (Generic Compression Benchmark, http://mattmahoney.net/dc/uiq/)
            46           46           46           46          46          46          46              Test_000 (0x00 * 256 KiB)
            48           48           48           49          49          49          48              Test_255 (0xff * 256 KiB)
       100,467      100,357      100,359      100,359     100,346     100,331     100,349     100,334  Test_000+Test_255+sharnd_challenge.dat
    Edit 2018/07/21 20:56:
    - updated the column "Nadam Byron" with the missing results; this column is now called "Nadam Byron" "sqrt(v+eps)".
    - added the column "Nadam Byron" "sqrt(v)+eps" and it's the first column of the 2 "Nadam Byron".
    - added the file "Test_000+Test_255+sharnd_challenge.dat", that is the sum of the 3 files tested individually and it is used to test how lstm-compress compress highly repetitive data followed by other highly repetitive data or data very hard to compress: weight decay really helps in this case.
    Last edited by Mauro Vezzosi; 21st July 2018 at 20:56. Reason: See Edit 2018/07/21 20:56

Page 1 of 2 12 LastLast

Similar Threads

  1. How to compress text bitmap?
    By hey in forum Data Compression
    Replies: 5
    Last Post: 4th August 2017, 19:55
  2. which compress algorithm is best for me?
    By skogkatt in forum Data Compression
    Replies: 19
    Last Post: 29th December 2016, 15:29
  3. The best way to compress 2 mostly idenctical VMs?
    By nimdamsk in forum Data Compression
    Replies: 17
    Last Post: 12th October 2014, 13:31
  4. How to Compress Games (repack)
    By thenokiottos in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2012, 22:04
  5. Compress-LZF
    By spark in forum Data Compression
    Replies: 2
    Last Post: 16th October 2009, 00:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •