Results 1 to 13 of 13

Thread: lost interest in text data compression

  1. #1
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    156
    Thanks
    18
    Thanked 50 Times in 26 Posts

    Unhappy lost interest in text data compression

    Recently I was completing my undergraduate graduation design: compressing a read-only database and support quick query. I selected this topic because interest in compression. but I soon found that LZ77, BWT or other complex models are useless with my topic. Those methods can easily compress the redundant database from 1TB to less than 1GB, but none of them supports quick query. finally I use sorting and common prefix coding to make it 10GB after compressed, support query at a speed of 100,000 records/s.

    I've studied lossless compression theory as interest for more than a year, and written some noob-like compressors. but my graduation design make me thinking more of it. nowadays disks and networks are much cheaper than CPU resource, for data with a lot of redundancy, deflate or lzma is enough in most time. why do we compress it to 100KB smaller but waste a lot of time on both compressing and decompressing? (I don't mean to blame Matt's PAQ, it's a great work.)

    Now my interest moved to database implementation, starting from Google's LevelDB and Fu Siyuan's CuttDB. maybe I will no longer playing compression..

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Well, I think you are right that the trend is toward fast compression of huge data sets with fast random access and update. High compression is mostly of theoretical interest as a way of measuring progress in AI, not for saving storage or bandwidth.

  3. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Are you aware of compressed self indexes? http://pizzachili.dcc.uchile.cl/index.html

    These may or may not have helped with your compressed database project. They provide a mix of compression and full-text indexing. The compression and search performance are both excellent. However, they are not very good at random access or at determining where things are positioned. In other words, you can very efficiently determine whether a substring is in the database and count the occurrences, but getting the positions of the matches is difficult, and reading at a specific location is difficult. It's not impossible, of course, but to get efficient positions you basically have to add on an additional index to translate from absolute positions to compressed positions, and there's no really great way to compress this index. So there's a direct trade-off between efficiency and space.

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    High compression is mostly of theoretical interest as a way of measuring progress in AI, not for saving storage or bandwidth.
    For some, it's also an all-consuming hobby and an obsession. Once you go down the rabbit hole, it's hard to find your way back out.

  5. #5
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I was hoping the bug bite you. Since you used BWTS which fewer than 4 or 5 people even give it the light of day. I know someday assuming no major war the new compression algorithms will all be created by AI and I suspect the chinese or indians are the only civilizations left that have a chance to do that before the liberals or barbarians destroy what is left of civilzation.
    If you give up on compression why not get involved in Artificial Intelligence to the level of creating a smart self aware computer that will replace most of mans work. Make them at least as smart at those in the tv series Caprica

    Take Care and
    Good Luck my follow traveler

  6. The Following 3 Users Say Thank You to biject.bwts For This Useful Post:

    Bloax (23rd April 2013),kvark (19th February 2014),nburns (23rd April 2013)

  7. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I have the book "Managing Gigabytes" by Witten, Moffat, and Bell. They describe how to make compressed indexes of large databases. The idea is to build a table that maps search terms to their locations. Once you have this index, you can discard the data it points to because you have enough information to reconstruct it. The index can be compressed by delta coding. It is basically a word-level order 0 model. It is not the best you can do, but good enough.

  8. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by nburns View Post
    For some, it's also an all-consuming hobby and an obsession. Once you go down the rabbit hole, it's hard to find your way back out.
    That's actually how I got started.

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I have the book "Managing Gigabytes" by Witten, Moffat, and Bell. They describe how to make compressed indexes of large databases. The idea is to build a table that maps search terms to their locations. Once you have this index, you can discard the data it points to because you have enough information to reconstruct it. The index can be compressed by delta coding. It is basically a word-level order 0 model. It is not the best you can do, but good enough.
    I've implemented the FM Index, which allows searching for any substring, not just search terms. It starts with the BWT, and then you apply a wavelet tree (depending on the variant). You have to compress the WT with a structure that supports rank select. The compression depends on the rank select structure, but it is pretty good.

  10. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    That's actually how I got started.
    I wonder if there is something unique about compression that pulls people in. I guess big, fat data files are sort of like whales. Trying to compress a file like enwik8 can make you feel like Ahab chasing the White Whale.

    The White Whale swam before him as the monomaniac incarnation of all those malicious agencies which some deep men feel eating in them, till they are left living on with half a heart and half a lung. That intangible malignity which has been from the beginning; to whose dominion even the modern Christians ascribe one-half of the worlds; which the ancient Ophites of the east reverenced in their statue devil;?Ahab did not fall down and worship it like them; but deliriously transferring its idea to the abhorred white whale, he pitted himself, all mutilated, against it. All that most maddens and torments; all that stirs up the lees of things; all truth with malice in it; all that cracks the sinews and cakes the brain; all the subtle demonisms of life and thought; all evil, to crazy Ahab, were visibly personified, and made practically assailable in Moby Dick. He piled upon the whale's white hump the sum of all the general rage and hate felt by his whole race from Adam down; and then, as if his chest had been a mortar, he burst his hot heart's shell upon it.

    ?Moby-Dick, Chapter 41. "Moby Dick"
    Last edited by nburns; 23rd April 2013 at 20:18.

  11. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    People like to see their names at the top of the benchmarks. Maybe I need to create a new one.

  12. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Hello there RichSelian!

    Seems you're still involved in compression algorithms And seems your job is aligned with your interests. Congratulations.

  13. #12
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    i agree that large texts al world means database. look for example on compression in tokudb.
    it is rather common to categorize hard to easy db problem when indexes can be ram cached or not.
    afaik max ram support for pc like srchitecture is fujitsu server with 384gb of ram
    of course data and index compression require very special applications

  14. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,471
    Thanks
    26
    Thanked 120 Times in 94 Posts
    Apache Cassandra uses LZ4, Snappy and Deflate with success and is distributed, highly scalable and highly available.

    If your data is small enough that you can store it entirely in RAM without using general compression algorithms then there's no point in using such algorithms. But once you're accessing the disk a lot, compression can speed matters up. For example, compressing database with eg Zhuff means you increase the effective size of your disk cache multiple times, leading to much less disk accesses if your data is accessed with non uniform probabilities (ie there are hot spots which are frequently accessed and most of the database is cold). Whether it make an overall speedup depends on the application.

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. Files hosted on Encode.ru and geocites.com are lost!
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 4
    Last Post: 30th April 2011, 15:51
  3. Text Detection
    By Simon Berger in forum Data Compression
    Replies: 15
    Last Post: 30th May 2009, 09:58
  4. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 01:57
  5. New rule for large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 5
    Last Post: 28th October 2007, 21:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •