Results 1 to 8 of 8

Thread: How efficient compress ordered numbers in text form?

  1. #1
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts

    How efficient compress ordered numbers in text form?

    We have lines:
    0000
    0001
    0002
    ......
    0102
    0103
    ......
    9998
    9999

    Thia and similar files are surprisingly well compressed by nanozip and cmix , cmix is more general - very good compression also with noises: replacement, or random add char to line.
    I think, we can use "channels" with "noises":
    first channel : 0000.000 - very good cinmpression and \n\n\n\n\n....- very good compression
    each 5 bytes id cycle (0123456789)*
    next is noise (0..01..12..2...9..9)*
    But how to algorithmized?
    Channels with noises will be also good for usual text files?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    There're 5 related topics here:

    1) Dictionary (sorted wordlist) compression
    2) Transposition of fixed-length structures (there're preprocessors which do that)
    3) Delta preprocessing (we could convert numbers to binary, then it'll work)
    4) Detection of errors
    5) Why paq/cmix compression is so good?

    What do you actually want to discuss?

    http://freearc.dreamhosters.com/srep393a.zip
    http://freearc.dreamhosters.com/mm11.zip
    http://freearc.dreamhosters.com/delta151.zip

  3. #3
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How use Delta? I call Delta on 60 kB "0000\n0001...……….\n9999\n" file but output is the same file except header.
    Srep works excellent, not only detect short cycles but even repeated text in book1x20. What is algorithm of Srep?
    5)interesting discussion on separate thread.


  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    delta:
    Code:
    49,995 1 // perl -e "for( 1..9999 ) { printf('%04i'.chr(10),$_); }" >1
    50,027 2 // delta 1 2 
    50,047 3 // delta 2 3
    50,055 4 // delta 3 4
       455 1.rar             
       591 2.rar             
       342 3.rar             
       344 4.rar             
     9,953 1.zip             
     1,560 2.zip             
       638 3.zip             
       644 4.zip
    dedup: http://mattmahoney.net/dc/dce.html#Section_527
    there's source in srep archive, but its hard to explain since it includes multiple different dedup algorithms.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Delta was developed for binary data, so seems it failed to find enough order in these textual data

    with mm, you can try 5*1 as the channel definition

    srep was developed over 10 years with focus on speed, so now its code is hardly comprehensible even for myself. I should write articles describing its algorithms sometime, but if you don't mind lesser speed, it can be quickly explained. f.e. the default algo may be implemented this way:

    go through the data, computing hash of each N bytes (default 512), i.e. hash(b[0]..b[N-1]), hash(b[N]..b[2*N-1]) and so on. Insert each (i,hash(i)) pair into hashtable. At the same time, compute hash of the next N bytes at each position, i.e. hash(b[1]..b[N]), hash(b[2]..b[N+1]), each time checking whether computed hash is already in hash table. This way, you will find any repetition with length of N bytes or more.

  6. #6
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I try:

    perl -e "for( 1..9999 ) { printf('%04i'.chr(10),$_); }" >perl1
    syntax error at -e line 1, near ",for"
    Regexp modifiers "/a" and "/d" are mutually exclusive at -e line 1, at end of line
    Unknown regexp modifier "/r" at -e line 1, at end of line
    Unknown regexp modifier "/z" at -e line 1, at end of line
    Unknown regexp modifier "/e" at -e line 1, at end of line
    Unknown regexp modifier "/j" at -e line 1, at end of line
    syntax error at -e line 1, near "})"
    Execution of -e aborted due to compilation errors.

    Creates "0000\n0000\n0000\n...."

  7. #7
    Member
    Join Date
    Jun 2019
    Location
    Poland
    Posts
    24
    Thanks
    0
    Thanked 0 Times in 0 Posts
    60 kB file of 10000 numbers is too short for Delta, but 8 MB file od million numbers is changed: after each block of 5-10 kilobytes is added block of information which can be used if we know it Delta format (?).
    Interesting is also short cycles (like "abcaad") detection by Srep, while efficient usual second repetition detection required no cycles.

  8. #8
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    80
    Thanks
    22
    Thanked 3 Times in 3 Posts
    Assuming you have 10 000 numbers of range 0000-9999 increased by one each time, you don´t need to compress anything. It can be generated using any simple incremental number generator. I mean you don´t need to compress anything. Thus, "compression" algorithm is very simple - generate 10 000 incremental numbers one by one via simple formula. That´s the best advice I can give you. I´m pretty sure that this is the most simplified form.

    CompressMaster

Similar Threads

  1. How much compress text worth ?
    By Obama in forum Data Compression
    Replies: 60
    Last Post: 14th July 2019, 07:35
  2. how to compress these numbers better?
    By Shelwien in forum Data Compression
    Replies: 8
    Last Post: 16th March 2019, 03:52
  3. How to compress text bitmap?
    By hey in forum Data Compression
    Replies: 5
    Last Post: 4th August 2017, 19:55
  4. Numbers vs text compression
    By irect in forum Data Compression
    Replies: 3
    Last Post: 7th March 2016, 02:24
  5. XWRT (XML-WRT) - an efficient XML/HTML/text compressor
    By inikep in forum Data Compression
    Replies: 14
    Last Post: 17th November 2015, 21:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •