Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: TurboPFor: Integer Compression

  1. #1
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    Exclamation TurboPFor: Integer Compression

    + TurboPFor: Integer Compression
    - 100% C (C++ compatible headers), w/o inline assembly
    - Direct Access
    - Integrated (SIMD) delta/Zigzag
    - Full range 16/32/64
    - Access TurboPFor incl. SIMD from Java as fast as calling from C.

    + Features
    - "Variable Byte"

    - "Variable Simple"

    - Scalar "Bit Packing" with Direct/Random Access (zero decompression)
    - "SIMD Bit Packing"

    - "Elias Fano"

    - Novel "TurboPFor" scheme with direct access.
    Outstanding compression and speed.

    - New: Scalar & SIMD Transform: Delta, Zigzag, Transpose/Shuffle, ...

    + Inverted Index ...do less, go fast!

    TurboPFor: Fastest Integer Compression
    Last edited by dnd; 15th April 2016 at 13:28.

  2. The Following User Says Thank You to dnd For This Useful Post:

    just a worm (31st May 2015)

  3. #2
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Updated: now support for gcc, clang & mingw-w64.

  4. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New Update:
    + more functions
    + 64 bits integer lists
    + Java Critical Native Interface. Access TurboPFor incl. SIMD from Java as fast as calling from C to C
    + Benchmark comparing to delta+transpose+lz4 (similar to delta+blosc).


    see: TurboPFor: Fastest Integer Compression
    Last edited by dnd; 2nd February 2016 at 21:35.

  5. #4
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New Update:
    + Zigzag encoding/decoding
    + Improved speed

    + My github answer to "
    Comparison with Blosc can be improved" showing applications of "Integer Compression"
    TurboPFor compress better, ~12 times faster and decompress ~13 times faster than blosc w/ zlib
    Last edited by dnd; 24th June 2015 at 00:02.

  6. #5
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Why is it faster than anyother technique

    Hello,

    When I am searching for fastest bit packing/unpacking techniques for the sake of my application, I came across TurboPFor. After I gone through the source code, I am left with following doubts.

    Why is the TurboPFor faster than any other packing/unpacking techniques? I tried exploiting the source code which I took from github but found nothing very special to make it that faster. W/o even using SIMD, it is produced unbelievable results. Please explain the algorithm u followed to pack and unpack the numbers with that huge speeds.

    I also found that, the value of number of bits into which the intergers should be packed is calculated based on input array. Is that where you are achieving performance by selecting optimal 'b' size ?

    Why are you declaring output array as unsigned char* when it is anyway typecasted to uint64_t* while packing ?

    And, Why are you packing into uint32_t* array instead of uint64_t* when the 'b' size is 32 ?

    Thank you.
    Last edited by mitra; 20th July 2015 at 09:45.

  7. #6
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by mitra View Post
    ..Why is the TurboPFor faster than any other packing/unpacking techniques?
    Glad you have interest on TurboPFor.
    Well, TurboPFor is exploiting capabilities avaiable on modern super scalar processors like branchless and out of order execution.
    Additionally, the dependencies between successive instructions are reduced to a minimum. The memory access/transfer is also reduced by the compression.

    Why are you declaring output array as unsigned char* when it is anyway typecasted to uint64_t* while packing ?

    All the TurboPFor functions have similar API interface and "char *" is used for the compressed buffer. This permits a consistent API interface.
    The output buffer is also not a multiple of "uint64_t". Memory output is done using 64 bits chunks.

    And, Why are you packing into uint32_t* array instead of uint64_t* when the 'b' size is 32 ?

    The bitpacking macros are template generated, so there is no difference in coding between differents b sizes.

    I suggest you read the papers included in the references.

  8. #7
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    I have an integer array and should pack the values into a 'long' array where each long element has two integers (two 32 bit integers packed into one 64 bit long value, reducing size to 50%). Simillarly unpack that when needed. Could you please suggest the fastest way to this packing/unpacking ?
    Here is what I have done
    packing :

    while(k < pack_count)
    {
    pack[k] = ((long int)data[z]<<32|(long int)data[z+1]<<0);
    k++;
    z=z+2;
    }

    where pack_count = (no. of integers)/2

    unpacking :

    while(i<count)
    {
    temp[++j] = ((pac[x] & 0xffffffff00000000) >> 32);
    temp[++j] = ((pac[x] & 0x00000000ffffffff) >> 0);
    i++; x++;
    }

    By this I couldn't achieve that high speed as we get from turbopfor.
    Last edited by mitra; 21st July 2015 at 12:17.

  9. #8
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Simply use a single call to bitpack/vp4dc w/ the integer Array. You have not to pack first to a 64 bits long.

  10. #9
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Thank you!

    Can we achieve "fixed length encoding" with this high speed??
    If yes, can you suggest how to do that?

  11. #10
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Look at icbench.c please.

  12. #11
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Hello,

    I am dealing with long values and gave the following instruction : ./icbench -eturbopack -n100 -M12345678223
    I have a column with 11-digit values (long) and I am trying to pack them using turbopack.
    But I got the error as : icbench.c:zipfgen:563:mallo error

    The 563rd line in code is : if(!(zmap = malloc(m*sizeof(zmap[0])))) die("mallo error\n");

    How to rectify this?
    I dont want to send data from file as in real time I dont get data from a file, I get data from memory.
    I have checked for any options which specify the turbopack to be 64-bit, but I dint found any such option in what you provided.

    Kindly help me out in packing and unpacking 'long' values using turbopack.

  13. #12
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Hello,

    How should the 'b' value, i.e the number of bits into which packing should be done, is calculated for long values(64-bit)?
    There is a function 'BITSIZE32' for 32-bit. Simillarly is there any way to calculate for 64-bit?

    Thank you.

  14. #13
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    You can simply use BITSIZE as in the macro BITSIZE32 (see bitutil.h):
    #define BITSIZE64(in, n, b) BITSIZE(in, n, b, 64)

  15. #14
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    (out of curiosity)

    Can you make it a Apache/BSD licensed package? Or atleast the files like bitpack64_.h and bitunpack64_.h can be of more generic license so that people can write their own (commercial)applications and use the packing/unpacking power of turbopfor.

  16. #15
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Sorry, but actually there are not plans to change the license.

  17. #16
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts

    why not uploading ?

    congratulation - sounds good
    ---
    lzturbo 1.2 Copyright (c) 2007-2014 Hamid Buzidi Aug 11 2014

    Usage: lzturbo <options> <filename1> .. <filenameN> DESTINATION_DIR
    <options>
    -ml m: compression method (1..4), l:compression level (0,1,2,9).
    -p# #: number of processors/cores (default=autodetect). 0=disable multithre
    ading
    -b# #: block size in mb (default 64)
    -r compress/decomp directories recursively
    -d decomp
    -f force overwrite of output file
    -o write on standard output
    Ex.: lzturbo -32 -f file.jpg ab*.txt backup_dir
    lzturbo -49 -r mydir\* backup_dir
    lzturbo -10 file file.lzt
    cat file | ./lzturbo -of -10 >file.lzt
    lzturbo -d -r backup_dir\* restore_dir
    lzturbo -d file.lzt file
    ---
    # Usage:
    ## Compression
    lzturbo -ml [-rfo] [-p#] <filename1> .. <filenameN> DESTINATION_DIR

    ml
    m: compression method (1..4), 1fastest decompression) ... 4: (best compression)
    l: compression level (0..2,9), 0fastest compression) ... 9: (best but slow compression)
    decompression time same for all levels.
    ml=10, 11, 12, 19 : Fast byte output
    20, 21, 22, 29 : Efficient byte ouptut
    30, 31, 32, 39 : Asymmetric Numeral System TurboANX w. JIT switching [scalar/SSE/AVX2]
    ? ? ? 49 : Bitwise Arithmetic Coding (?=40,41,42 not available)

    b#
    #: block size in mb (max. 1024), default: 64/128, 128mb for level 9
    use b1024 for max. compression and large files.

    ## Decompression
    lzturbo -d -p# [-fo] <filename1> .. <filenameN> DESTINATION_DIR

    ## Options for compression/decompression:
    r compress/decompress directories recursively
    p# #: number of processors/cores (default=autodetect). 0=disable multithreading
    f force overwrite of output file
    o write on standard output
    S disable ANS SIMD (method 3)
    DESTINATION_DIR: Destination directory or destination file.

    ## Decompression examples:
    lzturbo -d enwik9.lzt restore_dir
    lzturbo -d *.lzt restore_dir
    lzturbo -d -r backup_dir\* restore_dir
    lzturbo -d file.lzt file

    replace '\' by '/' for linux.
    ---
    best regards
    Attached Files Attached Files

  18. #17
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello,

    In TurboPfor's github page there is a table given with all comparisons among different algorithms.
    I found there is not much difference between turbopack-scalar version and turbopackV, the SIMD version of it.

    Theoritically there should be 4X improvement when SIMD is used as we treat 4 integers at a time.
    But There is very minimal improvement despite running on 100Million data (If it is that there should be huge data to compensate overhead of parallel execution, 100Million is huge enough I guess).

    Could you please tell why there is very minimal difference in speed between scalar and SIMD version of turbopack?

  19. #18
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    Scalar vs SIMD

    Quote Originally Posted by joey View Post
    ...I found there is not much difference between turbopack-scalar version and turbopackV, the SIMD version of it.
    Theoritically there should be 4X improvement when SIMD is used as we treat 4 integers at a time...
    Well, there are some reasons:

    - The scalar version TurboPack is higly optimized by using 64 bits registers and exploiting out of order execution.
    Other slower scalar bitpacking (ex. in FastPFor) are using 32-bits registers or direct memory to memory bit packing.

    - In my benchmarks, I'm always using practical large arrays to avoid the cache scenario found in micro-benchmarks from other authors.
    In the practice you are rarely repeatedly accessing the data that is already in the cache.
    Memory access becomes more and more important especially on multithreaded applications.
    For ex. search engines or databases are randomly accessing several gigabytes to answer queries.
    You can see the effect by using a small array (Ex. ./icbench -m0 -M255 -n1m)

    - but, in general the SIMD version TurboPackV is slightly faster and I'm using it in the inverted index app.


    SIMD is also not always faster and is depending on the implementation of scalar function used in the benchmark, see for ex. the TurboVByte vs MaskedVByte.
    The scalar "transpose" (shuffle) function in TurboPFor is also faster than the SIMD "shuffle" version in blosc.
    Last edited by dnd; 18th November 2015 at 15:52.

  20. #19
    Member
    Join Date
    Jan 2015
    Location
    chennai,Tamilnadu,India
    Posts
    19
    Thanks
    2
    Thanked 0 Times in 0 Posts
    The explaination was very useful.

  21. #20
    Member
    Join Date
    Nov 2015
    Location
    Mumbai
    Posts
    13
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Basically there are many algorithms for integers but very few of them will scale for long. I am happy that Turbopfor is one among them.

    ""What if SIMD is applied to Long data type too? Will it accelerate the process of pack/unpack?""

    Long is present in almost every table of a typical database. Can you suggest ways to accelerate compress/decompress of long datatype?

  22. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Quote Originally Posted by joey View Post
    ""What if SIMD is applied to Long data type too? Will it accelerate the process of pack/unpack?""
    It is possible to accelerate TurboPack64 by using SIMD for bit sizes <=32, but it is difficult to predict the performance before implementing.

    Quote Originally Posted by joey View Post
    Long is present in almost every table of a typical database. Can you suggest ways to accelerate compress/decompress of long datatype?
    "Integer Compression" is highly depending on the distribution of your data.
    You can try "variable simple" with a large block size. If most integers are small, you can try TurboPFor or TurboPack otherwsise try TurboVbyte.
    If the cardinality of the data is small, try to use dictionary mapping for better compression.
    Last edited by dnd; 26th November 2015 at 12:01.

  23. #22
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts

    Advantage of the scalar TurboPack

    Actually, you can use direct access only with the scalar version.
    Direct Access means, you can extract or update a single value (by position) without decompressing the whole block.
    This is usefull for example in column oriented databases (column stores).
    The Index search deliver a row number, that you can use to access directly the remaining row columns.

  24. #23
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New Skylake benchmarks: Integer Compression with more than 3 billions! per seconds or more than 12 GB/s.

  25. #24
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    TurboPFor (PFor/PFordelta) decoding in the "TurboPFor: Integer Compression" improved by 30%!

  26. #25
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    Scalar Variable byte "TurboVByte" in "TurboPFor: Integer Compression" improved and now up to 25%! faster.
    More efficient and faster than ANY other variable byte implementation. Even faster than "SIMD MaskedVByte".

  27. #26
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    TurboPFor Integer Compression update:
    QMX described in "Compression, SIMD, and Postings Lists" added to the benchmark app.

  28. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    State of the art TurboPFor Integer Compression compared and presented at ADCS2016
    Conference: "In Vacuo and In Situ Evaluation of SIMD Codecs"

  29. #28
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New TurboPFor Integer Compression update incl. AVX2:

    - New TurboPack AVX2 bitpacking: decoding 10 Billions integer/s!!!! (40 GB/s).
    - New TurboPFor AVX2, now 50%!!!! more faster
    - New TurboPFor Hybrid adaptive, better compression and more faster
    - New TurboVByte scheme: more compression and speed

  30. #29
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    - New Improved encoding for sorted/unsorted lists
    - TurboPFor ZigZag encoding for unsorted integer lists
    - TurboFOR encoding for sorted + unsorted integer lists
    - improved documentation
    - improved JAVA interface
    - New high level functions as simple as memcpy

  31. #30
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    456
    Thanks
    46
    Thanked 164 Times in 118 Posts
    New: (Differential) Finite Context Method FCM/DFCM for fast Floating Point Compression

    See: TurboPFor Integer Compression

  32. The Following User Says Thank You to dnd For This Useful Post:

    SolidComp (1st March 2017)

Page 1 of 2 12 LastLast

Similar Threads

  1. Fastest in memory compression c++
    By JangoFatXL in forum Data Compression
    Replies: 29
    Last Post: 10th April 2014, 22:03
  2. Replies: 4
    Last Post: 8th March 2014, 15:50
  3. Fastest compression for high-speed I/O?
    By wheezil in forum Data Compression
    Replies: 4
    Last Post: 22nd July 2011, 13:53
  4. Fastest decompressor!?
    By Sanmayce in forum Data Compression
    Replies: 66
    Last Post: 13th April 2011, 01:18
  5. Fastest Compressors
    By LovePimple in forum Forum Archive
    Replies: 0
    Last Post: 1st November 2006, 06:36

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •