Page 1 of 2 12 LastLast
Results 1 to 30 of 31

Thread: Comprehensive wiki for data compression

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts

    Comprehensive wiki for data compression

    Hi all – It seems like it would be a good idea to have a comprehensive wiki for data compression, a site that included in-depth articles on every subtopic, encoding method, codec implementation, benchmarks, histories of and evolution of certain approaches, etc.

    Would any of you be willing to help deploy the wiki? I'd need some help for this to be deployed by say July. If I have to do it by myself, it will take longer.

    I already bought two applicable domains – compression.wiki and compress.wiki. They'll probably cost about $25 per year to renew, each. I think the total cost of hosting the wiki site, including domain registration, would be $100-200 per year. I assume a donation link on the site will be sufficient to take care of it, maybe with a live "balance needed" display or something.

    Anyway, who's willing to help?

    Cheers,

    Joe Duarte
    https://www.researchgate.net/profile/Joe_Duarte

  2. The Following 3 Users Say Thank You to SolidComp For This Useful Post:

    boxerab (6th June 2017),Gonzalo (21st April 2017),Jarek (6th June 2017)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I guess you can take some stuff from https://web.archive.org/web/20160620...m/authors.html

    Also you can save on hosting by letting me or webmaster host it.

    But the main question is what exactly is your point in this. It won't be profitable for certain - you won't even return hosting/domain costs from donations/ads.
    Also wikipedia already has plenty of posts about data compression - https://en.wikipedia.org/wiki/Catego...ion_algorithms .
    The only (imho) important thing that lacks in wikipedia is the "who did what" list like Sami's site had, because most compression developers
    are not popular enough for wikipedia. Otherwise there's not that much information really, its mostly covered by http://mattmahoney.net/dc/text.html etc.

    There're also problems with copyrights (for example, there's a DCC paper archive and decompiled sources of various codecs, but we can't make them easily available),
    and its hard to say whether to include audio/video compression (because general data compression would easily drown in it, like it happened with compression.ru),
    and how to describe the algorithms (eg. python is useless for data compression).

  4. #3
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Wikipedia has a few drawbacks for scientific and technical info. One of them is its policy about "notability".
    See this for an example:
    https://en.wikipedia.org/wiki/Wikipe...letion/NanoZip

    Much of the interesting things in data compression happen on the alpha side of development. In Wikipedia, you basically have to wait until it is yesterday's news to publish it.

  5. The Following User Says Thank You to Gonzalo For This Useful Post:

    SolidComp (26th April 2017)

  6. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    I found that wiki page: http://nishi.dreamhosters.com/u/nanozip.txt
    Can't say that it would have been better to keep it.

  7. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    you can compare that to the first version of freearc page: https://en.wikipedia.org/w/index.php...ldid=308460316

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, unlike nanozip page, this doesn't look like plain advertisement.
    ...And then Sami appeared in deletion talk and killed it, by using his own benchmark site as proof of nanozip's performance.

  9. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    464
    Thanks
    202
    Thanked 81 Times in 61 Posts
    I agree with you, but you are missing the whole point. It doesn't mater whether it was written as an advertisement, because it wasn't the reason it went kicked off. It was the lack of notability. And that's ok because what's "notable" for the whole world differs a lot from what it is for a person into the data compression world.
    Just try and add an entry about any ground-breaking, not-thoroughly-tested program on its beta version like some of the present here in the forum. Your reflate, for an example, or GLZA. I would love to have an article explaining in depth data re-compression, but that's not gonna happen any time soon inside Wikipedia...

  10. #8
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

    The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.
    Last edited by nemequ; 24th April 2017 at 11:49.

  11. The Following User Says Thank You to nemequ For This Useful Post:

    schnaader (24th April 2017)

  12. #9
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by nemequ View Post
    Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

    The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.
    I think something like wikia would be a better choice as it doesn't require registration.

  13. #10
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Technically neither would GitHub/GitLab. You don't have to use the web API; you could easily accept contributions with git send-email, diffs posted as attachments here (or anywhere else), etc.

    Accepting unregistered edits may not be a good idea anyways… it tends to make abuse easier.

    Personally, I'd feel a lot more comfortable with a DVCS where lots of people have copies instead of trusting everything to one site. Wikia would be safer than setting up something custom, but I'm not excited about the idea of a bunch of ads plastered all over the place… some company offering spyware with a veneer of compression software could easily buy some ad space and make it seem like an endorsement.

    That said, I probably wouldn't contribute much to a wiki like this (I already spend more than enough time writing documentation), so I'm not sure why anyone should care about my opinion here.

  14. The Following User Says Thank You to nemequ For This Useful Post:

    schnaader (25th April 2017)

  15. #11
    Member
    Join Date
    Dec 2014
    Location
    Berlin
    Posts
    29
    Thanks
    35
    Thanked 26 Times in 12 Posts
    Equations can be easlily supported like in LaTex if you use the GitHub repo as markdown source for a pandoc generated github.io site (
    pandoc -f markdown+tex_math_dollars+simple_tables --mathjax -t html5
    ).

  16. #12
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    I'm not sure where exactly the line is between a wiki and an open-source, collaboratively developed web site is, but that seems pretty firmly in the latter camp. Jekyll + GitHub pages may be a bit closer since GitHub will automatically regenerate the HTML, but I think that's still unacceptably complex… You would also lose the web based wiki editor if you used something like pandoc or jekyll.

  17. #13
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by Shelwien View Post
    I guess you can take some stuff from https://web.archive.org/web/20160620...m/authors.html

    Also you can save on hosting by letting me or webmaster host it.

    But the main question is what exactly is your point in this. It won't be profitable for certain - you won't even return hosting/domain costs from donations/ads.
    Also wikipedia already has plenty of posts about data compression - https://en.wikipedia.org/wiki/Catego...ion_algorithms .
    The only (imho) important thing that lacks in wikipedia is the "who did what" list like Sami's site had, because most compression developers
    are not popular enough for wikipedia. Otherwise there's not that much information really, its mostly covered by http://mattmahoney.net/dc/text.html etc.

    There're also problems with copyrights (for example, there's a DCC paper archive and decompiled sources of various codecs, but we can't make them easily available),
    and its hard to say whether to include audio/video compression (because general data compression would easily drown in it, like it happened with compression.ru),
    and how to describe the algorithms (eg. python is useless for data compression).
    Yo Shelwien, it never occurred to me that it could be profitable, and that definitely isn't a goal. I'm pretty sure we could recover the hosting costs – what makes you think that would be hard?

    Wikipedia has major limitations for this, as Gonzalo said. We'll just have to deal with whatever copyright issues come up – that won't be a problem for most content.

    When you say that you could host it, what site are you referring to? Or do you have a hosting service?

  18. #14
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by nemequ View Post
    Why not just host it on GitHub or GitLab? They both have free wikis, and most people around here probably have at least a GitHub account already. Best of all, while both sites let you edit it through a web interface, you can also use git to download the whole wiki and edit it locally (it's just a bunch of markdown documents). Not only is that a nice feature for working with the repo, but it help make backups really easy in case GitHub/GitLab ever goes away.

    The only real disadvantage I see is that (AFAIK) neither supports embedding equations directly into the page, though you can use iTex2Img (or something similar) and embed an image in the wiki.
    I think GitHub wikis are pretty bad. The design is confusing, and the Table of Contents are strange. They also have a major bug in that their Table of Contents in general don't work on mobile. If you tap on a heading in their TOC's on Chrome in Android, for instance, nothing happens. The anchor links don't work on mobile, and this has been a known bug for years. That's a pretty severe bug that should've been fixed in a week.

    Equations are extremely important.

    (I'm not 100 percent certain that the TOC links on their Wiki pages have the same problems as on the normal repo readme pages, but that they have such massive bugs annoys me.)

    I know less about GitLab. I should take a look at their wiki.

  19. #15
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by m^3 View Post
    I think something like wikia would be a better choice as it doesn't require registration.
    The problem I have with Wikia is that it's a brutal website – it's just so heavy and slow, probably because of too many HTTP requests and ads galore. On mobile, it's a terrible experience trying to lookup a Game of Thrones character or something. It's just so slow to load and strains the browser.

    So I should probably make this clear: the compression wiki needs to be a damn good website, clean and fast on all platforms. Most websites suck, and I want it to not be like most websites.

  20. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > Yo Shelwien, it never occurred to me that it could be profitable, and that definitely isn't a goal.

    Once upon a time, around 1995-2005, it could be.

    > I'm pretty sure we could recover the hosting costs – what makes you think that would be hard?

    Experience with this forum, for example (and other related sites).
    Of course, you might have some friends who would donate $100 at once,
    but you certainly won't get 20 x $5 from random people.
    With something like nanozip, maybe, but not with a wiki site.

    > When you say that you could host it, what site are you referring to? Or do you have a hosting service?

    I'm already paying for hosting service anyway (dreamhost), so I don't have a problem
    with adding another low-load site there. You'd just have to configure your domain(s)
    to use their NS, and I'd make an account for you.

  21. #17
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by nemequ View Post
    Technically neither would GitHub/GitLab. You don't have to use the web API; you could easily accept contributions with git send-email, diffs posted as attachments here (or anywhere else), etc.
    This doesn't sound like a serious proposal.

    Quote Originally Posted by nemequ View Post
    Accepting unregistered edits may not be a good idea anyways… it tends to make abuse easier.
    Not easier, just saves a couple of minutes.

    Quote Originally Posted by nemequ View Post
    Personally, I'd feel a lot more comfortable with a DVCS where lots of people have copies instead of trusting everything to one site. Wikia would be safer than setting up something custom, but I'm not excited about the idea of a bunch of ads plastered all over the place… some company offering spyware with a veneer of compression software could easily buy some ad space and make it seem like an endorsement.
    Quote Originally Posted by SolidComp View Post
    The problem I have with Wikia is that it's a brutal website – it's just so heavy and slow, probably because of too many HTTP requests and ads galore. On mobile, it's a terrible experience trying to lookup a Game of Thrones character or something. It's just so slow to load and strains the browser.
    That's indeed a major concern. Whatever ads they have, they don't pass through my guards, so I wasn't aware of it having any advertisments, let alone being plastered with them. And it's super-fast here too.

  22. #18
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    I'm not sure if anyone is still interested in this, but I just found out that GitLab supports equations (using KaTex).

  23. The Following 2 Users Say Thank You to nemequ For This Useful Post:

    pothos2 (23rd May 2017),schnaader (29th May 2017)

  24. #19
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    222
    Thanks
    89
    Thanked 46 Times in 30 Posts
    Quote Originally Posted by nemequ View Post
    I'm not sure if anyone is still interested in this, but I just found out that GitLab supports equations (using KaTex).
    Interesting. I've been playing with KaTeX and MathJax. I probably like KaTeX more because it's easier to generate the HTML for the equation ahead of time and not force users to download a big JS library which then renders the equation on the fly, a typically terrible pattern of web development. I'm still trying to understand how they make equations look so good with just HTML and CSS, and no MathML (which only Firefox and Safari support).

    Does GitLab support custom domains?

  25. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    While I completely understand the need for a comprehensive wiki for data compression (e.g. benchmarks with generally agreed objectivity, less known compressors, some their details), a more basic and complementing activity is improving the Wikipedia which already exists and is widely used: where the data compression part is very far from perfect - maybe let us also discuss this issue here ... and try to bring some improvements and updates there.

    Here are some my remarks (any others?):

    - the template is a complete mess ( https://en.wikipedia.org/wiki/Templa...ession_methods ) mixing everything like methods with actual compressors, predictors with transformations - I have written some comment a few months ago ( https://en.wikipedia.org/wiki/Templa...ession_methods ) but there was no response (generally data compression seems to be ignored by editors),

    - missing articles and updates like general about some repacking - discussing the growing set containing e.g. packJPG, mozjpg, lepton, guetzli,

    - guarding existing articles from vandalism - maybe specialists in some topics should have them on their watchlists (sometimes checked). For example I have currently "discussion" regarding https://en.wikipedia.org/wiki/Huffman_coding - while on its top there was clear warning with example about its suboptimality and popular ways used to repair it (AC and ANS), a person with many Huffman articles is just trying to hide this information by replacing this paragraph with extremely innocent "Huffman coding is not always optimal among all compression methods",

    - maybe deletion of suspicious articles (e.g. https://en.wikipedia.org/wiki/Statis...el%E2%80%93Ziv ?), or merging some like e.g. "arithmetic-range coding", "PPM-DMC-CTW", "dozens of LZ variants" (which often don't even clear distinction between them - maybe there should be additional collective articles clearly specifying these differences?).

  26. #21
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    Agreed that improving Wikipedia would reach the widest audience. Some details may be too technical, but we also have Matt's excellent pages too.

    That huffman article is particularly confusing. "Although Huffman's original algorithm is optimal for a symbol-by-symbol coding..." followed by (correctly) "Huffman coding is optimal when each input symbol is a known independent and identically distributed random variable having a probability that is the inverse of a power of two". The first statement i clearly means the choice of bit-codes is optimal, but not that bit-codes themselves are optimal.

  27. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Actually Huffman code is not an optimal bitcode.
    Its an optimal _prefix_ bitcode.
    I have a working implementation of non-prefix bitcode, which provides better compression than huffman
    (with escape codes and backtracking).

    Btw, I installed mediawiki: http://ctxmodel.net/w/index.php?title=Special:Log

  28. The Following User Says Thank You to Shelwien For This Useful Post:

    Jarek (7th June 2017)

  29. #23
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Shelwien, could you give a simple example of a better than Huffman bitcode?

  30. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    As I said, its a code without prefix property.
    I can generate an example if you really want, but a comprehensible step-by-step description would take some work.
    For now, I found this: https://encode.ru/threads/1116-Charl...ll=1#post22277

    My implementation started with huffman code and then iteratively replaced huffman codes with shorter codes which weren't observed
    in the block (so additional compression comes from higher-order statistics).
    Thus encoding is very slow, but decoding is just a little slower than huffman.

    The decoder has to work kinda like a recursive parser for a programming language.
    At some point it would encounter multiple matches, so it would have to try every choice,
    until it can parse to the end of block without meeting any escape codes.

    P.S.: Added an example, but I doubt that you'd be able to see anything it it.
    Unfortunately its a very old experiment (~1999) so I only have DOS implementations for it.
    Attached Files Attached Files

  31. #25
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    As Huffman is generally believed to be the optimal bitcode (not only prefix), it would be great to have a simple counterexample which can be easily traced - using just a few symbols (3?) ...
    A follow up, but much more difficult question, is understanding how bad it is - like theoretical boundary for improvements with non-prefix codes and maybe examples reaching or approaching it ...

  32. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    1. One simple example with 4 symbols ('a','b','c','d') is where huffman assigns codes a:0 b:10 c:110 d:111,
    but 'aa' never occurs in the data. Then 00 code can be assigned to 'c', and greedy parser would always
    correctly decode 00 as 'c'. No backtracking needed in this example, because all other codes start with 1.

    2. Its similar to how we can get better compression by grouping the symbols before assigning huffman codes.
    In any case, it won't be able to beat order* entropy for the data, and then practical results depend more on the specific implementation.

  33. The Following User Says Thank You to Shelwien For This Useful Post:

    Jarek (8th June 2017)

  34. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    1. huffman code is optimal for order-0 data source with given probability distribution

    2. it's not a bit coder. instead, like ari/ans, it just finally ouputs some sequence of bits

  35. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    We just discussed that with Shelwien. My conclusion: his program includes table with optimal code for each particular file, thus overall code for data source isn't a bit code any more, from theoretical POV. But it doesn't make any difference from practical POV - non-prefix code may work as fast as Huffman, while providing better compression ratio

    Moreover, if we have encoded symbol probabilities in the data, they don't actually represent order-0 model! Once we have started decoding, probabilities of remaining symbols changes as we decode string prefix, and we can modify the code according to changed probabilities
    Last edited by Bulat Ziganshin; 7th June 2017 at 17:30.

  36. #29
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    645
    Thanks
    205
    Thanked 196 Times in 119 Posts
    Ok, so Huffman is still optimal (not only prefix) "bitcode" (symbol -> bit sequence) ... for i.i.d. sources (what indeed should be explicitly stated).

    Anyway, https://en.wikipedia.org/wiki/Huffman_coding article is kind of "gateway to data compression" - Huffman coding is rather the widest learned data compression algorithm, and many of these students will likely check its Wikipedia article - it is crucial that it properly shows the situation and maybe interest this person in the data compression field.
    First of all, it should clearly (top of article) emphasize the suboptimality of Huffman/prefix codes (and mention most popular ways to repair it), like ~1% ratio loss due for letter-wise English text, or like in the deleted paragraph: "As an example, a symbol of probability 0.99 carries only <math>\log_2(1/0.99)\approx 0.014 </math> bits of information, but Huffman coding encodes each symbol separately and therefore the minimum length for each symbol is 1 bit" ... but this Huffman-crusader (dozens of Huffman/prefix codes articles) has removed this information about its weaknesses and trivialized it with this "Huffman coding is not always optimal among all compression methods."
    A person visiting this article might be also searching for a good (fast) implementation - there was link to Yann's in External Links ... but this person has just deleted it.

    And no editor has intervened - this is example that if data compression society wants to be properly represented in the widely visited: Wikipedia articles, we should improve and guard them ourself.

  37. #30
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    437
    Thanks
    137
    Thanked 152 Times in 100 Posts
    My first compressor was a Huffman one, on a trusty BBC Micro (in 6502 assembly).

    It's a tricky balance between being too wordy so that people don't bother to read it and insufficiently detailed. Huffman can be "sufficient" for most cases and it's simple so there is nothing wrong in people using it - even reimplementing what everyone else has done already (likely badly, like my first effort). The key is to explain when it doesn't work. For example pure huffman can only ever get to 8:1 compression ratio, max. (Unless you have a single symbol case in which case the prefix code is zero bits.) To better it you need something else (deduplication, run-length encoding, or just not huffman!). Small alphabet sizes or extreme distributions are its primary weaknesses.

Page 1 of 2 12 LastLast

Similar Threads

  1. loseless data compression method for all digital data type
    By rarkyan in forum Data Compression
    Replies: 157
    Last Post: 9th July 2019, 17:28
  2. Replies: 81
    Last Post: 11th November 2016, 15:46
  3. Next step in Data Compression
    By thometal in forum Data Compression
    Replies: 9
    Last Post: 9th August 2014, 04:15
  4. lossless data compression
    By SLS in forum Data Compression
    Replies: 21
    Last Post: 15th March 2011, 11:35
  5. Wiki?
    By Bulat Ziganshin in forum Forum Archive
    Replies: 3
    Last Post: 26th July 2007, 22:13

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •