Results 1 to 11 of 11

Thread: Linux memory manager discussion, same problem as compression

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts

    Linux memory manager discussion, same problem as compression

    An article complaining about Linux's page cache eviction policy came up on HackerNews, and I noticed that it's solving the same problem that occurs in compression: predicting the next element of a sequence based on prior examples:

    https://news.ycombinator.com/item?id=6809251


    The page cache size is fixed, so it acts like a fixed-size dictionary that tries to capture and intercept the greatest number of future page requests to prevent them from going all the way out to disk. The page addresses are like phrases. The question is, what phrases should be kept in the dictionary? The Linux memory manager allocates 50% of the page cache to the most recently requested pages. That part works like LZ's sliding window. It allocates the other 50% to pages that have been accessed more than once over a longer time period. That would be like a dictionary that gives preference to phrases that have occurred more than once, with slow replacement.

    Streams of requests follow predictable patterns. Some processes request pages in straight sequential order, and may make subsequent requests for the entire sequence starting from the beginning. Other processes request pages all over the address space, but commonly re-request pages that have been requested recently, whether by themselves or another process. The system-wide pattern of requests would tend to resemble that found in compressible data: you could gather statistics that show a Zipf distribution.

    I think experience from compression would be valuable here. I wonder if there are request traces available that are suitable for research. An algorithm that yielded good compression on a representative sample of traces would be an important precursor to a solution that would work in an OS memory manager.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    forecasting the future is distinctive feature of all creatures that differs us from dead matter

  3. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    What makes us distinct from other creatures is that we do it with algorithms.
    Last edited by nburns; 29th November 2013 at 02:32.

  4. #4
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Actually, while the eviction problem is related to the conventional compression problem, it's also hugely different in a very basic way---you don't need or really want precise, detailed predictions in anywhere near the same sense as for compression.

    Consider the OPT replacement algorithm, (a.k.a. Belady's "Min") which many people mistakenly think of as THE optimal (demand paging) replacement algorithm.

    OPT evicts the specific page that won't be used for the very longest time, but that specificity isn't needed in real-world replacement policies, and isn't a practical goal. It's main use is that it enables efficient simulation of an optimal replacement policy for all memory sizes in a single pass through a trace, in n log n time.

    For a given memory size on an actual machine, there are a LOT of optimal or very nearly optimal replacement policies, most of them very different from OPT. You just evict anything that won't be touched for a long time, where "a long time" is relative to the actual memory size. Any page that won't be touched before you've touched a memory-full of other pages is an optimal candidate for eviction.

    What you want to do is to divide the in-memory pages into those that will probably be accessed soon and those that won't, and evict anything that won't to make room for something that will. That's all you need to do, and it's usually easier to approximate than OPT.

    (Some of my papers refer to this as "timescale relativity"; You might want to google up EELRU for a good example.)

    The same principle applies to things like prefetching, which is very often mistakenly analogized to LZish compression.

    In general, you don't need or particularly want to predict exactly which page will be touched next, for several reasons. One is that by the time you predict which exactly page will be touched next, like predicting a literal string, it's too late, and if you have to do a seek anyhow, it doesn't improve performance. (It's just not worth the overhead to predict a memory acess a tiny fraction of a seek time ahead.)

    What you generally need to do is to divide the pages NOT in memory into those that will (likely) be touched soon and those that (likely) won't, and do sequential prefetching for those that will, but inhibit sequential prefetching for those that won't. You use the prediction to determine whether to grab the next page(s) on disk, and how long to cache them if they go untouched for a while, rather than to determine which page to grab.

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    OPT evicts the specific page that won't be used for the very longest time, but that specificity isn't needed in real-world replacement policies, and isn't a practical goal. It's main use is that it enables efficient simulation of an optimal replacement policy for all memory sizes in a single pass through a trace, in n log n time.
    That is a "clairvoyant" algorithm that isn't actually possible in reality. It just gives a lower bound. I think the compression equivalent would be an impossible encoding that uses future information in its model, i.e. information that's unknown to the decoder. Actually, any time you calculate Shannon entropy without accounting for transmitting the model itself, i.e. like static Huffman minus the tree, it would be pretty much the same concept as OPT.

    The same principle applies to things like prefetching, which is very often mistakenly analogized to LZish compression.
    The problem is fundamentally the same. It just has somewhat different constraints, and the cost of a request is much less fine-grained than in compression: you might only count whether the request was in cache or not. In compression it typically matters not only if you get to use the dictionary, but the reference itself also has a cost in bits. So prefetching might not be able to be as sophisticated as LZ; it might have to make coarser-grained predictions that aren't updated on each reference. It's interesting, though, that apparently CPUs use perceptrons to predict branches; which is, again, fundamentally the same problem as compression.

    Notice that compression usually works by shortening references to frequently-requested data. If the reference is some finite bit sequence, then that could be interpreted as a path through a tree to access the reference. So minimizing the lengths of the codes is equivalent to minimizing the number of tree nodes that have to be looked at to serve that request. In fact, there is a natural correspondence between being able to service requests efficiently and being able to encode requests efficiently. Any solution for one problem can be generalized to solve either (although a solution is usually better suited to one or the other).

    In general, you don't need or particularly want to predict exactly which page will be touched next, for several reasons. One is that by the time you predict which exactly page will be touched next, like predicting a literal string, it's too late, and if you have to do a seek anyhow, it doesn't improve performance. (It's just not worth the overhead to predict a memory acess a tiny fraction of a seek time ahead.)
    If you could make predictions about which pages are likely to be re-referenced, you could potentially use this to do prefetching or to decide which pages to evict.
    Last edited by nburns; 7th December 2013 at 01:56.

  6. #6
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    "That is a "clairvoyant" algorithm that isn't actually possible."

    Right. I should have made that explicit.

    OPT is one of two major that real replacement policies are often compared to, just for reference, with the other being LRU.
    Nobody can implement OPT as an online algorithm, and LRU is well understood and works pretty well---it is within a small constant factor of OPT in terms of memory required to achieve a certain hit rate. (If you give it 2x the memory you give OPT, it will do no worse, which makes it clear that we're talking about a factor of two savings in memory AT MOST for any replacement policy over LRU. There's not a similar bound on the difference in miss rates for a given fixed size of memory---LRU can do orders of magnitude worse than OPT for a given size memory. E.g., if you have M pages of memory, and a loop that just touches each of M+1 pages, once each, it will always miss, because LRU will always stupidly evict the Mth page just before it is touched again.)
    A realistic implementable policy can approximate LRU in cases where it's very good, and diverge strikingly from it in cases where LRU is bad---e.g., somewhat too-big loops---and that's a practical thing to do that keeps LRU from working a lot worse than OPT in most cases. (That's what EELRU does. It acts sorta like MRU for large loops, but still keeps very recently touched data in memory because there's usually multiple things going on, such that LRU is good at some scales while MRU is good at others. Most things with big loops are doing other stuff as they iterate through the loop, and LRU tends to work well for that stuff.)

    "The problem is fundamentally the same. It just has somewhat different constraints, and the cost of a request is much less fine-grained than in compression: you might only count whether the request was in cache or not. In compression it typically matters not only if you get to use the dictionary, but the reference itself also has a cost in bits."

    Let me explain the big difference I'm talking about. When people analogize replacement or prefetching policies to LZ- or explictly Markov-ish compression, they tend to see a sequence of page touches as a string of symbols, and think that it is the same kind of prediction problem---given a sequence of symbols, predict the next symbol or the next n symbols in the sequence.

    It's not that kind of problem. You don't generally care at all about the actual ordering of those symbols in that string, except in terms of whether you can predict which symbols will be in the string "pretty soon," in any order.

    So for example, I may iterate through an array forwards, then do something else with other data structures, and then iterate through that array forward or backwards in the next phase where I use it. (I may do that to improve locality, if either order will work, or because it's natural for the computation, if it's implementing a big stack.) The first batch of references to the array may share no common substrings to the second batch, but still convey a lot of exploitable locality information---once I start touching the array again, I'm going to touch the whole thing, so aggressively prefetching will work just great.

    Another common example is a hash table. I may hash a bunch of stuff, and touch most or all of the hash table I'm scattering the keys in, in a pseudo-random order. Then I may has a different bunch of stuff, and again touch all or most of the hash table, in a different pseudo-random order. Or I might step through the hash table in memory order one time (say, to initialize it), then later bang on it pseudorandomly now and then in later phases, each time touching it in a different pseudo-random order.

    Seeing that as a string matching or Markov-type problem is just a mistake. It's the wrong kind of model, because it isn't about literal repeats of substrings or probabilities of next symbols based on the most recent substring. A Markov model of any order just can't capture that very common, simple, and robust kind of regularity that is crucial for replacement and prefetching policies.

    Queue-like, stack-like, and hash-table-like access patterns are all common in real programs, as are combinations where you treat something one way at one time, and another the next times. (Especially when you take memory reuse by allocators and garbage collectors' access & reallocation patterns into account.)

    Viewing the reference string as a string to be compressed by looking for common subsequences misses the point. What you're looking for is not common subsequences at all. You're looking for sets of things that are used together, and sets of things that are not used together, or sometimes used together and sometimes not. (That's crucial if you're doing clustering of related stuff into pages, such that simple paging and sequential prefetching work well---you want to separate things that are sometimes used differently, so that they don't have to all be in memory if any of them are.)

  7. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Ok. I accept that trying to prefetch requests as they are coming in may be too difficult. But there are many, many other ways to do compression.

    The complaint in the link I posted originally had to do with Linux doing a suboptimal job of caching long sequential reads. It was caching one and completely evicting the other.

    If the page cache recognized those as sequential reads, then it could treat them as atomic and apply special logic in deciding how to cache them. So, in that case, it wouldn't simply be trying to predict one request ahead. It would be trying to break past requests into meaningful chunks ahead of time, i.e. like a compressor that maintains a semi-static dictionary off-line. The goal would be to make better decisions on which pages to evict, rather than do prefetching after the pages have already been evicted.

    This kind of logic might not fit into LZ, but it would not be out-of-place in ZPAQ.
    Last edited by nburns; 7th December 2013 at 03:00.

  8. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Ok. I accept that trying to prefetch requests as they are coming in may be too difficult. But there are many, many other ways to do compression.
    Yes, that's why I was talking specifically about LZ-ish or straightforward Markov-ish compression---it implicitly uses the wrong kind of model, but that's what most people think of when they make the analogy.

    My WKdm wordwise compression algorithm's (implicit) model is more like what you want for caching or prefetching. It only assumes that stuff you've seen very recently will likely be seen again very soon (maybe with different low bits), and pays no attention to sequences across word boundaries. (Works great for a lot of in-memory data representations, not so well for typical text files, where you do have a lot of longish literal string repetitions it doesn't notice.)

    The complaint in the link I posted originally had to do with Linux doing a suboptimal job of caching long sequential reads. It was caching one and completely evicting the other.
    Yes, that's a really bad idea.
    If the page cache recognized those as sequential reads, then it could treat them as atomic and apply special logic in deciding how to cache them.So, in that case, it wouldn't simply be trying to predict one request ahead. It would be trying to break past requests into meaningful chunks ahead of time, i.e. like a compressor that maintains a semi-static dictionary off-line. The goal would be to make better decisions on which pages to evict, rather than do prefetching after the pages have already been evicted.
    Right, if I understand you correctly.

    What I think you usually want to do is prefetch fairly aggressively---it doesn't cost much more to grab a bunch of blocks than to grab one---and then evict things judiciously, paying due attention to the relevant timescale of your cache. A prefetch is a generally a good one if you come back and touch the prefetched block any time before you touch a whole bunch of other blocks, from any number of other files, where "a whole bunch" is a considerable fraction of your cache size.

    It's also usually a bad thing to think in terms of sequential front-to-back reads and optimize that as a very special case. You usually want to think in terms of basic spatial locality---if you're repeatedly fetching things that are near each other, something's wrong, and you need to increase your fetch size, and it usually doesn't matter whether you prefetch ahead of where you're fetching, or just fetch larger hunks of data around where you're fetching. (E.g., 16K or 64K aligned chunks instead of 4K blocks.)

    In the case of strictly sequential front-to-back reads, doing that will work just fine. Your first big fetch may fetch some stuff behind where you fetch, when what you need is what's ahead, if that's how the big chunk alignment works out, but all the subsequent sequential reads will fault on the first block of a big chunk and fetch all the rest too.

    If you do it that simple way, you handle other cases of spatial locality as well, not just the sequential case. So if you're going backwards through a file, it does the same basic thing, and works just fine instead of falling on its face.

    And if you're randomly reading the file (e.g., because it holds a hashed database index), it still works. If the whole file fits easily in your cache, you'll aggressively fetch the whole file, which is probably the right thing to do if you keep fetching random blocks of it. As long as you evict stuff in a timescale-relative way and don't let that pollute your whole cache when it's not working out, you'll do fine.

    Ideally, how long you hold on to prefetched blocks vs. normally cached (touched) blocks should depend on the kind of simple online cost-benefit analysis we do for simple replacement in EELRU, and for adjusting the compression cache size in compressed memory caching.

    You essentially have two basic predictors in your model, one of which says that you're likely to touch things soon if you touched them already lately, and another that says that you're likely to touch things soon if you touched things spatially near them lately.

    You can tell which of those predictors is working better, on the fly, by noticing how often you touch prefetched pages vs. how often you touch pages you touched already but have gone untouched for a long time. E.g., if you're fetching a lot of pages you could have prefetched but didn't, or did prefetch but discarded before you touched them, you'd benefit from using more space for prefetches; if you're touching a lot of pages you just evicted, you'd benefit from more space for those pages. More generally, you can keep running histograms of how much it pays off to hold on to prefetched pages for how long, and how much it pays of to keep actually touched pages for how long, and do whatever works better lately. (The model histograms are actually independent of what's really in memory or got evicted---they're modeling the programs' behavior directly, rather that what you really did or didn't prefetch or evict.) That will tell you directly how big a fraction of your cache it's worth using for prefetches lately.

    This kind of logic might not fit into LZ, but it would not be out-of-place in ZPAQ.
    I've been meaning to learn more about ZPAQ and how easy it is to code some "weird" models I want to test out. Soon-ish, I hope.

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    It's also usually a bad thing to think in terms of sequential front-to-back reads and optimize that as a very special case. You usually want to think in terms of basic spatial locality---if you're repeatedly fetching things that are near each other, something's wrong, and you need to increase your fetch size, and it usually doesn't matter whether you prefetch ahead of where you're fetching, or just fetch larger hunks of data around where you're fetching. (E.g., 16K or 64K aligned chunks instead of 4K blocks.)

    In the case of strictly sequential front-to-back reads, doing that will work just fine. Your first big fetch may fetch some stuff behind where you fetch, when what you need is what's ahead, if that's how the big chunk alignment works out, but all the subsequent sequential reads will fault on the first block of a big chunk and fetch all the rest too.

    If you do it that simple way, you handle other cases of spatial locality as well, not just the sequential case. So if you're going backwards through a file, it does the same basic thing, and works just fine instead of falling on its face.
    You should go back and look at the HN post. It's a particularly difficult case. Sequential reads in general defeat LRU, because they go the opposite way that LRU expects. If the length of the sequence is larger than the buffer cache length, a second iteration through the sequence will always be reading pages after they've been evicted.

    You can tell which of those predictors is working better, on the fly, by noticing how often you touch prefetched pages vs. how often you touch pages you touched already but have gone untouched for a long time. E.g., if you're fetching a lot of pages you could have prefetched but didn't, or did prefetch but discarded before you touched them, you'd benefit from using more space for prefetches; if you're touching a lot of pages you just evicted, you'd benefit from more space for those pages. More generally, you can keep running histograms of how much it pays off to hold on to prefetched pages for how long, and how much it pays of to keep actually touched pages for how long, and do whatever works better lately. (The model histograms are actually independent of what's really in memory or got evicted---they're modeling the programs' behavior directly, rather that what you really did or didn't prefetch or evict.) That will tell you directly how big a fraction of your cache it's worth using for prefetches lately.
    Yeah, there are many potential ways to design eviction/prefetch policies. You can try out ideas on traces to validate them. My proposal is that taking traces and trying to compress them would give you -- if not an exact solution, at least some useful information to start from.

    I've been meaning to learn more about ZPAQ and how easy it is to code some "weird" models I want to test out. Soon-ish, I hope.
    I haven't learned how to use it, I admit. But I sort of figured out what it's for, and that's a start.

  10. #10
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    You should go back and look at the HN post. It's a particularly difficult case. Sequential reads in general defeat LRU, because they go the opposite way that LRU expects. If the length of the sequence is larger than the buffer cache length, a second iteration through the sequence will always be reading pages after they've been evicted.
    This is a very familiar problem to me. It's essentially a problem of what to do when you have a small sample, and don't know how to extrapolate behavior from it. Until you've seen a loop iterate twice, you can't tell it's a loop, and if it only iterates two or three times, it's too late to learn much from it.

    One example of this is a simple garbage collector, which allocates all the available memory, then garbage collects most of it, then allocates whatever it reclaims all over again. If your GC does that with more memory than fit in RAM, you are screwed. LRU will tend to ensure you have page faults at at least the rate of memory allocation, because you reuse more memory at each cycle than fits in RAM, and no replacement policy can help a whole lot. (Even if you get the GC to reuse memory in a more stacklike order, you're still usually going to be paging unacceptably.)

    What you want is for your replacement policy to have a general idea of the large-scale locality properties of each process it deals with, knowing or learning whether a program (or memory-intensive phase of a program) tends to access a lot of memory in a loopy fashion, a stacky back-and-forth fashion, or apparently randomly (like for a program dominated by an enormous hash table without a huge hot/cold skew), or in a quasi-statically biased way that frequency-based replacement would work well for.

    (Frequency-based replacement usually doesn't work well at all, but there are a few things it does work well for, like programs whose locality is dominated by a huge hash table that has very hot and not-hot parts at the VM page granularity. Most don't, though because if you have hot parts scattered pseudo-randomly through an array, it's likely most pages will contain at least one hot entry. From the point of view of virtual memory, a typical hash table is usually pretty hot all over if any of it is.)

    In hard cases, like programs that loop (or don't) through more than a memory-full of data, you want to keep information between runs of the program, so you notice things like whether the program usually just reads a bunch of data and ends before iterating over it, or typically goes back through the data twice or three times or more.

  11. #11
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Paul W. View Post
    nburns:

    This is a very familiar problem to me. It's essentially a problem of what to do when you have a small sample, and don't know how to extrapolate behavior from it. Until you've seen a loop iterate twice, you can't tell it's a loop, and if it only iterates two or three times, it's too late to learn much from it.
    Well, you keep as much sample data as you want. For buffer cache eviction, it's not going to deal with anything that immediate. Incidentally, though, since you mention a loop: branch prediction in the cpu works at that scale, and it gets fairly impressive results.

    One example of this is a simple garbage collector, which allocates all the available memory, then garbage collects most of it, then allocates whatever it reclaims all over again. If your GC does that with more memory than fit in RAM, you are screwed. LRU will tend to ensure you have page faults at at least the rate of memory allocation, because you reuse more memory at each cycle than fits in RAM, and no replacement policy can help a whole lot. (Even if you get the GC to reuse memory in a more stacklike order, you're still usually going to be paging unacceptably.)

    What you want is for your replacement policy to have a general idea of the large-scale locality properties of each process it deals with, knowing or learning whether a program (or memory-intensive phase of a program) tends to access a lot of memory in a loopy fashion, a stacky back-and-forth fashion, or apparently randomly (like for a program dominated by an enormous hash table without a huge hot/cold skew), or in a quasi-statically biased way that frequency-based replacement would work well for.

    (Frequency-based replacement usually doesn't work well at all, but there are a few things it does work well for, like programs whose locality is dominated by a huge hash table that has very hot and not-hot parts at the VM page granularity. Most don't, though because if you have hot parts scattered pseudo-randomly through an array, it's likely most pages will contain at least one hot entry. From the point of view of virtual memory, a typical hash table is usually pretty hot all over if any of it is.)

    In hard cases, like programs that loop (or don't) through more than a memory-full of data, you want to keep information between runs of the program, so you notice things like whether the program usually just reads a bunch of data and ends before iterating over it, or typically goes back through the data twice or three times or more.
    The original posting to HN that inspired this was about caching data from the filesystem. I'm aware that I've been ambiguous on this thread, and I'm not sure if Linux manages this cache together with VM pages or separately. But with the filesystem, data would outlast any single process, so you could keep track of data per disk block or per file. With a process's own memory, there would probably be a lot more guesswork in figuring out what's going on. Getting back to my proposal to compress traces: I think filesystem traces would make better candidates, because virtual memory traces would have too little information. On top of that, processes are aware of LRU and try to optimize their memory use for it. There's less that can be done about file accesses.

Similar Threads

  1. Compression Problem
    By Nquisitive in forum Data Compression
    Replies: 19
    Last Post: 13th October 2012, 23:42
  2. Compression state-of-art discussion with C.Bloom
    By Shelwien in forum Data Compression
    Replies: 7
    Last Post: 27th October 2011, 06:18
  3. Discussion initiated by a amateur
    By Fu Siyuan in forum Data Compression
    Replies: 35
    Last Post: 18th August 2009, 02:33
  4. Replies: 8
    Last Post: 12th April 2009, 02:39
  5. Can't allocate memory required for (de)compression..help!
    By Duarte in forum Data Compression
    Replies: 19
    Last Post: 18th July 2008, 18:14

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •