Results 1 to 21 of 21

Thread: "Extreme" compression of DNS domains

  1. #1
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question "Extreme" compression of DNS domains

    I wish to very compactly encode the domain of each website a random Internet user visits, but find that the compression achieved by generic tools is insufficient.

    It ought to be possible to derive better models both from the formal restrictions:
    • each A-label must match ^[a-z\d]([a-z\d\-]{0,61}[a-z\d])?$; and
    • the entire domain including '.' label delimiters must not exceed 255 characters

    and from heuristics, including:
    • lower-order U-labels are often lexically, syntactically and semantically valid phrases in some natural language including proper nouns and numerals (but stripped of all whitespace and punctuation except hyphen, and folded per Nameprep), with a preference for shorter phrases;
    • higher-order labels are drawn from a dictionary of SLDs and TLDs and provide context for predicting which natural language is used in the lower-order labels; and
    • statistics of website popularities (such as from Alexa or Netcraft), potentially refined according to some environmental context.

    One thought is to build a Huffman coding of e.g. Alexa's list of the top 1m websites, together with an escape code for non-listed sites.

    The question remains how best to encode such non-listed sites. I currently imagine using the same training set to build Huffman codes of every TLD and many popular SLDs; then choose from a predefined set an adaptive natural language model and initial state for the lower order labels, indicating which combination has been used with another Huffman code (perhaps itself in the context of the TLD and/or SLD).

    However, I am a little unsure what type(s) of adaptive natural language models would work best (perhaps after applying dictionary transforms): my research to date has led me to PPM and DMC, although I'm not even sure that a BWT-based approach wouldn't be better for some languages. I would welcome your thoughts.

    I would, of course, like to avoid reinventing the wheel if at all possible: so please do shout if existing tools might already meet my needs, or if I'm treading a long way from solid ground.
    Last edited by nickety; 20th October 2011 at 18:10.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Please clarify, do you want to compress a domain list?
    or do you want to produce compact codes for single domain names?
    (eg. to provide random access to the same list)

    Though either way, arithmetic coding would be much better here (than huffman).

  3. #3
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Please clarify, do you want to compress a domain list?
    or do you want to produce compact codes for single domain names?
    (eg. to provide random access to the same list)
    The latter: compact codes for single domain names.

    Though either way, arithmetic coding would be much better here (than huffman).
    I agree that arithmetic encoding should be used for the natural language parts (which involve multiple tokens), but surely there is no benefit over Huffman when encoding only a single token (such as which of all known TLDs applies)? Worse, because arithmetic coding requires one to additionally signal EOF, won't it lead to a less compact coding than Huffman?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > The latter: compact codes for single domain names.

    Ok. Then, as a first test, I'd start by "training" a CM (or PPM, but that would be more complex) model
    on some set of data, then using it to encode each domain name (without model updates, or with model backup, or with update tracking/undoing).
    It would be also useful to fix up some things in a rangecoder to improve precision.

    > Worse, because arithmetic coding requires one to additionally signal EOF,

    Technically there's no reason why you can't assign code intervals to whole strings,
    the same way as you do with huffman codes.
    Sure, huffman codes are optimal bitcodes, but arithmetic codes won't be much
    more redundant, but likely easier to manage.

    > won't it lead to a less compact coding than Huffman?

    For whole strings huffman code can be more efficient (only if you have stable statistics),
    but not for single symbols.
    And anyway the actual problem is about coding of new domains, right?..

  5. #5
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Thumbs up

    Quote Originally Posted by Shelwien View Post
    Ok. Then, as a first test, I'd start by "training" a CM (or PPM, but that would be more complex) model
    on some set of data, then using it to encode each domain name (without model updates, or with model backup, or with update tracking/undoing).
    It would be also useful to fix up some things in a rangecoder to improve precision.
    Great, thanks for these tips - but sadly most of it flew straight over my head!

    ***EDIT***
    If I understand correctly, you're suggesting that I train a context mixing compressor? Are there any ready-to-go tools out there that will do this?

    For whole strings huffman code can be more efficient (only if you have stable statistics),
    but not for single symbols.
    And anyway the actual problem is about coding of new domains, right?..
    Yes, quite right. I'll pull back from Huffman!
    Last edited by nickety; 20th October 2011 at 19:56.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Well, you can start by testing with external compressors.
    Prepare pairs of files like "dictionary0.txt" and "dictionary1.txt", where 2nd file just has one string (domain name) appended.
    Then compress them both with any PPM/CM and look at the differences in file sizes.
    Basically, the compressed versions would also have a matching part at start, same as uncompressed ones.
    That should be enough to choose a compressor, parameters (like memory usage) and a dictionary.
    As to actual compressors to evaluate, I'd suggest
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
    http://mattmahoney.net/dc/#paq9a
    (note that paq compressors frequently write the filename into archive, so names need to have the same length)
    BWT won't work here, because its blockwise and can't be trained.

    You can also try applying this filter before testing - http://xwrt.sourceforge.net/

  7. #7
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    I wish to very compactly encode the domain of each website a random Internet user visits, but find that the compression achieved by generic tools is insufficient.
    What are your goals? That you can store how many domains into a string how long? Is this an offline or an online system? Are international punycoded domains allowed?

    I'm imagining ppm for the names, with special domain knowledge introducing common stems as a single token e.g. a name starting with www. and ending .com could be symbol 256 and so on; furthermore, storing them in sorted order since that further constrains values.

    You can turn all TLDs and country codes into single symbols, and if you encode them before the name they can influence the model of the name itself e.g. perhaps .jp domains have a little more japanese character to the ascii names than .se names.
    Last edited by willvarfar; 20th October 2011 at 22:07.

  8. #8
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by willvarfar View Post
    What are your goals? That you can store how many domains into a string how long?
    The goal is to represent the domain part of each website a "regular" user browses into no more than 5 bytes; based on back-of-envelope calculations, I think this ought not to be unreasonable.

    Is this an offline or an online system?
    Offline training is certainly possible (indeed, I wholly expect it to be employed), as are multiple passes over the data to be encoded; does this answer your question? If not, I'm afraid I'm not too sure...

    Are international punycoded domains allowed?
    IDNs are certainly allowed, but I imagine it will be easier to compress them in unicode (not punycode) form, where information patterns will be more readily evident and therefore models somewhat simpler.

    I'm imagining ppm for the names, with special domain knowledge introducing common stems as a single token e.g. a name starting with www. and ending .com could be symbol 256 and so on; furthermore, storing them in sorted order since that further constrains values.
    To be clear, I only need to encode the domain, not the hostname (so "www." etc. can readily be dropped): I anticipate using Mozilla's "public suffix list" to determine the domain of a given website.

    You can turn all TLDs and country codes into single symbols, and if you encode them before the name they can influence the model of the name itself e.g. perhaps .jp domains have a little more japanese character to the ascii names than .se names.
    Great - this is exactly the sort of thing I had been thinking: glad to hear it wasn't a totally stupid idea!
    Last edited by nickety; 21st October 2011 at 10:59.

  9. #9
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Can you communicate with a server?

    You can certainly simply store an integer ID of all seen domain names in a hashtable on a server, and simply ask that server to give you a list of the IDs. The server would allocate a new ID every time it encounters a new domain.

    This is a dictionary coder.

    8 bytes is enough for anyone that way!

    (With a simple length-prefix on the codes (or UTF-8 them) you would typically expect to be using much closer to 2 or 3 bytes per domain, especially if you primed the dictionary with the top n websites in order)
    Last edited by willvarfar; 21st October 2011 at 09:13.

  10. #10
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by willvarfar View Post
    Can you communicate with a server?
    Sadly not.

    You can certainly simply store an integer ID of all seen domain names in a hashtable on a server, and simply ask that server to give you a list of the IDs. The server would allocate a new ID every time it encounters a new domain.

    This is a dictionary coder.
    Agreed; this is pretty much what I had envisaged to do with the top 1m sites by distributing a dictionary with the encoder/decoder. I think my question is really about how I can best handle the exceptions.

    8 bytes is enough for anyone that way!

    (With a simple length-prefix on the codes (or UTF-8 them) you would typically expect to be using much closer to 2 or 3 bytes per domain, especially if you primed the dictionary with the top n websites in order)
    5-bytes (sorry, 8 was a typo before) is an absolute upper limit, but every bit is precious; so, I think Huffman or arithmetic coding offer better average compression than this?
    Last edited by nickety; 21st October 2011 at 10:54.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    It certainly won't fit into 5 bytes - on average for some time period, maybe, but there's no way
    to describe all names within 5 bytes, compression or not.

    Even your huffman code for a fixed dictionary likely won't fit, because its a variable-length code by nature.

  12. #12
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    It certainly won't fit into 5 bytes - on average for some time period, maybe, but there's no way
    to describe all names within 5 bytes, compression or not.
    I agree that there cannot be any way to describe every possible legal domain in 5 bytes, but when one observes that many web sources put the total number of registered domains in the order of 10^8, 5 bytes is certainly more than enough to simply enumerate them: and therefore one ought to be able to conclude that the entropy is quite low, so armed with a sufficiently accurate model of the space, 5-bytes could work?

    Do you think that 5-bytes is unreasonable if I encode the data along the proposed lines (i.e. TLD/SLD by dictionary, followed by a language identifier then a very short phrase - usually a few words at most - using some context model specific to that language)?
    Last edited by nickety; 21st October 2011 at 14:01.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    To make it more clear, lets replace the domain names with 8-byte random numbers.
    Then, if you simply have a table with 10^8 of such numbers, even 27 bits would be enough to enumerate them.
    But to encode a new number, you'd still need 8 bytes.

    In other words, your regexp (or any other reasonable rules) won't restrict the choice of domain names
    to 256^5 combinations.

    Also, compression methods assign shorter codes to more probable strings, and longer codes to less probable ones,
    so plain enumeration is much better if you need fixed-size codes.

  14. #14
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    To make it more clear, lets replace the domain names with 8-byte random numbers.
    Then, if you simply have a table with 10^8 of such numbers, even 27 bits would be enough to enumerate them.
    But to encode a new number, you'd still need 8 bytes.

    In other words, your regexp (or any other reasonable rules) won't restrict the choice of domain names
    to 256^5 combinations.
    Understood. But I'm not trying to restrict the choice of "every possible domain name" to 256^5 combinations. I'm trying to restrict the choice of websites that a "regular" web user might visit to 256^5 combinations. Using your random-number analogy, I would argue that the 8-byte numbers are not at all random but rather exhibit some very predictable patterns: once one considers the decisions taken by website operators in selecting a domain name (i.e. something short, memorable and meaningful in their chosen language) and additionally refines the set according to known statistics about the domains of sites that are popular, 256^5 starts looking very reasonable indeed.

    Also, compression methods assign shorter codes to more probable strings, and longer codes to less probable ones,
    so plain enumeration is much better if you need fixed-size codes.
    Perhaps there has been a slight misunderstanding: I do not need fixed-size codes. But I wish to express the domain of any website a "regular" web user happens to visit within no more than 5-bytes. I can tolerate that the domain of an exceptionally rare site might require more space to encode, but I would still like the expected length of its encoded form to be minimised.

    My working assumption is that this can be achieved by modelling probabilities of (TLD,SLD,language,etc) combinations from an initial training set. But where I am coming most unstuck is in deciding on appropriate models for the natural language elements: per my opening post, my initial research led me to PPM and DMC and I'm continuing to play with them. But I'm not sure whether there is a better class of natural language modelling at which I should look (perhaps different ones are more appropriate for different natural languages), or whether I'm miles off track altogether.
    Last edited by nickety; 21st October 2011 at 17:11.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > I'm trying to restrict the choice of websites that a "regular" web
    > user might visit to 256^5 combinations.

    With a large database of known names, it may be possible to keep
    the empirical average within 5 bytes.
    But there won't be any kind of guarantee for that.
    For example, it would be certainly possible to intentionally
    push that average up to 100 bytes, by supplying specially
    prepared input data.

    > I would argue that the 8-byte numbers are not at all random but
    > rather exhibit some very predictable patterns:

    my 8-byte random numbers are equivalents of your up-to-255-byte strings.
    Patterns on not, but you're not likely to reach better than 1bpc (bit per symbol)
    compression rate for text-like data.

    > 256^5 starts looking very reasonable indeed.

    Sorry, but it doesn't to me. At all.

    > But where I am coming most unstuck is in deciding on appropriate
    > models for the natural language elements: per my opening post, my
    > initial research led me to PPM and DMC and I'm continuing to play
    > with them.

    Forget DMC - there're no efficient implementations.
    As to PPM, you only need to consider ppmd - there're no better open-source
    implementations and you're not likely to write one within reasonable time.
    Also PPM is relatively complex and imprecise, which makes it not the best
    choice for efficient coding of short strings.

    > But I'm not sure whether there is a better class of natural language
    > modelling at which I should look (perhaps different ones are more
    > appropriate for different natural languages), or whether I'm miles
    > off track altogether.

    Unfortunately I don't have enough sample data, or I'd just test it
    myself and post some real numbers.
    But I already explained in a previous post how you can evaluate
    existing compressors - you're likely to reach the best results
    with paq8px or maybe xwrt+paq8px (although its likely not the
    best choice for real use).

    http://mattmahoney.net/dc/text.html

  16. #16
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    my 8-byte random numbers are equivalents of your up-to-255-byte strings.
    Patterns on not, but you're not likely to reach better than 1bpc (bit per symbol)
    compression rate for text-like data.
    But they're not up-to-255-byte-strings. Beside the "effective TLD" (which is limited to a known dictionary), we only need to compress one DNS label - that's a limit of 63 bytes. Furthermore the character set is more limited than ordinary text (there are no capitals, punctuation or whitespace), so one ought to be able to improve on 1bpc - but even if not, the longest possible such label would still compress to under 8 bytes.

    In reality few (if any) of the domains we're discussing (i.e. those of websites visited by normal users in the real world) will actually be 63 characters long: website operators want much shorter names because they're more memorable and easier to type without error. 20 characters is a lot. But let's say we have a worst case of 40 characters, still compressing at our rather poor 1bpc of punctuated capitalised text: that still gives us a 5 byte compressed result.

    But I already explained in a previous post how you can evaluate
    existing compressors - you're likely to reach the best results
    with paq8px or maybe xwrt+paq8px (although its likely not the
    best choice for real use).

    http://mattmahoney.net/dc/text.html
    Thank you again for those suggestions - I have taken them on board and am experimenting.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    Unfortunately even a random string of symbols from [0-9a-z] of length 0..63 would take 42 bytes to store.
    See http://www.wolframalpha.com/input/?i....%2C37%5E63%5D
    Also http://www.wolframalpha.com/input/?i...x%5D%3D%3D5%5D
    (that's max length that would fit into 5 bytes)

    And as I said, the problem is not the expected average compressed size,
    but your expected size for new names.
    Of course I won't know for sure until we'd really test it, but I'd expect
    something closer to 10 bytes than 5.

  18. #18
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    And as I said, the problem is not the expected average compressed size,
    but your expected size for new names.
    I think our views only differ on whether new names are predictable. I argue that they are, as they follow the pattern of being short, memorable and meaningful in some natural language. Unless I've misunderstood you, it appears your view is that new names are random?

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,134
    Thanks
    179
    Thanked 921 Times in 469 Posts
    > it appears your view is that new names are random?

    They're not, but we can't exclude any of the random ones either.

    I'm only saying that imho the average codelength for new domain names (not in database)
    would be longer than 5 bytes; also it can be potentially much longer (100 bytes maybe).

    But we can wait and see the results of some actual experiments with dictionary-based coding.

  20. #20
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Do you need random access to the names? I.e. must each name start on a bit boundary? Or some other aligned boundary?

  21. #21
    Member
    Join Date
    Oct 2011
    Location
    London, UK
    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by willvarfar View Post
    Do you need random access to the names? I.e. must each name start on a bit boundary? Or some other aligned boundary?
    Each name needs to be encoded entirely separately and independently: there is no intention to compile them into a list.

Similar Threads

  1. File "Type" identification tool
    By soor in forum Data Compression
    Replies: 4
    Last Post: 6th June 2011, 04:04
  2. The lie of "The world is a globe"
    By Vacon in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 14th December 2009, 16:58
  3. PAQ8 C++ precedence bug (or "-Wparentheses is annoying")
    By Rugxulo in forum Data Compression
    Replies: 13
    Last Post: 21st August 2009, 20:36
  4. LZ77 speed optimization, 2 mem accesses per "round"
    By Lasse Reinhold in forum Forum Archive
    Replies: 4
    Last Post: 11th June 2007, 21:53
  5. Freeware "Send To" interface for CCM and QUAD
    By LovePimple in forum Forum Archive
    Replies: 2
    Last Post: 20th March 2007, 17:22

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •