Results 1 to 12 of 12

Thread: Strange 32 bit malloc() behavior

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

    Strange 32 bit malloc() behavior

    I am investigating strange memory allocation behavior in 32 bit Windows. The program below allocates n arrays of size m (as a linked list) until out of memory, and then frees the list. m is passed as a parameter. Next it allocates all available memory again, but this time using a binary search to find the largest array it can allocate at each step and continuing until this size is 0.

    I am testing in 32 bit Vista with 3 GB RAM compiling with g++ and VC++. With either compiler, about 2 GB can be allocated in either pass regardless of m. I tried g++ -Wl,--large-address-aware with several builds (4.8.1 Mingw-x64, 4.8.1 GCC, 4.8.0 niXman), which is supposed to increase the address space to 3 GB, but the option has no effect.

    The strange behavior I am observing is that if the node size of the linked list is m = 520185 or larger, then the second pass is able to allocate a single contiguous array of a little less than 2 GB followed by several smaller arrays. If m is 520184 or smaller then malloc() is unable to allocate any arrays larger than 520184, although the total is still about 2 GB. This behavior is the same with g++ and VC++. So I am guessing it is a bug in the operating system or some other shared code that freeing all memory does not unfragment it.

    Here is an example. mtg was compiled with g++ 4.8.1 -O3. mt was compiled with VC++ 16.00.30319.0 /O2.

    Code:
    C:\tmp>mtg 520185
    4071 x 520185 = 2117673135 allocated
    allocated 1968701408 at 009E0020
    allocated 140378080 at 77110020
    allocated 13500384 at 76300020
    allocated 8126432 at 7F7F0020
    allocated 2162656 at 76040020
    allocated 1834976 at 00240020
    allocated 1376224 at 00410020
    allocated 44032 at 000233E0
    allocated 8184 at 0002DFE8
    allocated 48 at 000217A0
    allocated 0 at 00000000
    4071 x 520185 = 2117673135 initially allocated and freed
    total reallocated 2136132424
    
    C:\tmp>mt 520185
    4074 x 520185 = 2119233690 allocated
    allocated 1965293536 at 00D20020
    allocated 140378080 at 77110020
    allocated 16383968 at 76040020
    allocated 8126432 at 7F7F0020
    allocated 5767136 at 00780020
    allocated 917472 at 00690020
    allocated 720864 at 00040020
    allocated 73624 at 001F0048
    allocated 65432 at 00020048
    allocated 57336 at 00201FE8
    allocated 42672 at 00773930
    allocated 8184 at 0077DFE8
    allocated 208 at 00771A80
    allocated 0 at 00771A80
    4074 x 520185 = 2119233690 initially allocated and freed
    total reallocated 2137834944
    
    C:\tmp>mtg 520184
    3979 x 520184 = 2069812136 allocated
    allocated 520184 at 7A8EF048
    allocated 520184 at 7B9BD048
    allocated 520184 at 7CA8B048
    allocated 520184 at 7DB59048
    allocated 520184 at 7EC27048
    allocated 520184 at 76973048
    allocated 520184 at 7FC6F048
    ... (4K lines not shown)
    
    C:\tmp>mt 520184
    3981 x 520184 = 2070852504 allocated
    allocated 520184 at 7A9ED048
    allocated 520184 at 7BABB048
    allocated 520184 at 7CB89048
    allocated 520184 at 7DC57048
    allocated 520184 at 7ED25048
    allocated 520184 at 75B9F048
    allocated 520184 at 7F40F048
    ... (4K lines not shown)
    When I sort the pointers, they are mostly at intervals of 520192 = 2^19 - 2^12, which is 8 bytes larger than the array size.

    Of course this is different from 64 bit code. I tested a similar program in 64 bit Ubuntu with 4 GB real memory and 3 GB swap. It can allocate 140 TB in chunks of 7 GB as long as you don't write to it.

    Code:
    // mt.cpp - test allocating all memory. Public domain.
    
    #include <stdio.h>
    #include <stdlib.h>
    
    // Allocate the largest array possible. Print and return its size.
    size_t mtest() {
      size_t lo=1, hi=size_t(-1), mid=0, best=0;
      void* p;
      while (lo<hi) {
        mid=lo/2+hi/2+((lo&1)*(hi&1));
        void* p=malloc(mid);
        if (p) best=mid, free(p), lo=mid+1;
        else hi=mid-1;
      }
      while (best>0 && (p=malloc(best))==0) --best;
      printf("allocated %1.0f at %p\n", best+.0, p);
      return best;
    }
    
    int main(int argc, char** argv) {
      int m=0;
      if (argc>1) m=atoi(argv[1]);  // get m
      if (m<sizeof(void*)) m=sizeof(void*);
    
      // Build a linked list of n nodes of size m bytes until out of memory.
      int n=0;
      void* start=0;
      void *p;
      while ((p=malloc(m))!=0) {
        ++n;
        *(void**)p=start;
        start=p;
      }
      printf("%d x %d = %1.0f allocated\n", n, m, n*1.0*m);
    
      // Free the list
      while (start) {
        p=*(void**)start;
        free(start);
        start=p;
      }
    
      // Allocate all available memory as arrays as large as possible
      size_t sum=0, t;
      while ((t=mtest())>0) sum+=t;
      printf("%d x %d = %1.0f initially allocated and freed\n", n, m, n*1.0*m);
      printf("total reallocated %1.0f\n", sum+.0);
      return 0;
    }
    Last edited by Matt Mahoney; 5th March 2014 at 01:09.

  2. #2
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Don't know if it helps:
    If the heap has a size limit then every allocated block needs to be smaller than 524280|d bytes. The size limit can be applied if the function "HeapCreate" has been used.

    If the heap is beeing used, which has not been created manually but instead provided by the portable executable loading programm from Microsoft then there is no size limit. But who knows what thouse compilers do behind the back of the programmer.

    When I sort the pointers, they are mostly at intervals of 520192 = 2^19 - 2^12, which is 8 bytes larger than the array size.
    The official documentation says that the function "HeapAlloc" may return a bigger buffer than has been requested. If it does, then the whole buffer can be used. But to be honest: I have never recieved a buffer which is just a single byte bigger than what I requested.
    Last edited by just a worm; 5th March 2014 at 16:47.

  3. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I was testing the linker heap options, but it only seems to reduce the available space without changing the fragmentation behavior. For example:

    g++ -O3 -Wl,--heap=500000000 mt.cpp
    or
    cl /O2 mt.cpp /link /heap:500000000

    reduces available memory from about 2.1 GB to 1.6 GB. The stack option has the same effect in either compiler.

  4. #4
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    allocating a big amount of a continous memory buffer isn't easy in Windows.
    - In Windows 2000 the library "ntdll.dll" will start at the address "2,012,741,632".
    - In Windows XP the library "kernel32.dll" will start at the address "2,088,763,392".

    As far as I know setting the "large-address-aware"-flag from the portable executable file format only allows you to use a little bit of the memory at adresses >2 GB. But you can't use all of it because Windows still maps some common data for all applications into your linear address space.

    In any case the 2 libraries
    - "ntdll.dll" and
    - "kernel32.dll"
    will be in your linear address space no matter wether you have imported anything from them. And as far as I know it will be really difficult to keep the loader from placing them at their desired loading addresses.

    I am not an expert on c++. But maybe it is possible to call the low level functions from the operating system manually. There is a function which is called "HeapAlloc" which is responsible for returning the requested memory. This functions asks, as a parameter, from which heap to allocate the memory. You can use the function "GetProcessHeap" to get the identification number for the not-limited heap which is beeing created by the loader.

  5. #5
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Are you committed to using the Microsoft allocator?

    You might try Hoard, which is generally much better and cross-platform, with OSX, Windows, and Linux versions, 32- & 64-bit: http://www.hoard.org/

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. I tried hoard but it did not solve the problem. I am mainly interested in using it for the 32 bit Windows version of zpaq to make efficient use of the address space because it is not a problem for 64 bit code. I tested with mt.cpp but it can still only access 2 GB in 32 bit Windows. It also does not fix the fragmentation problem, but instead changes the threshold from 520184 to 327984 bytes. That is, if all memory is allocated in chunks larger than the threshold and then freed, it is possible later to allocate one large chunk close to 2 GB, then many smaller pieces. If all of the memory is allocated in chunks at or below the threshold and then freed, it is not possible later to allocate a chunk larger than the threshold.

    Hoard is licensed GPL v2, which would not be a problem for me because zpaq is GPL v3. One problem is that I can only compile and link with VC++, not g++, and the user would have to have libhoard.dll in their PATH to run the program.

    Also, mt.cpp runs slower with Hoard and does not initially allocate as much memory for the linked list. It leaves a lot of small fragments that do eventually get allocated in the second pass. Without hoard:

    Code:
    C:\tmp>timer32 mt 1000000
    2034 x 1000000 = 2034000000 allocated
    allocated 1968766944 at 009D0020
    allocated 140378080 at 77110020
    allocated 16383968 at 76040020
    allocated 8126432 at 7F7F0020
    allocated 1966048 at 007C0020
    allocated 1114080 at 00230020
    allocated 983008 at 00040020
    allocated 57240 at 00020048
    allocated 42936 at 009A3828
    allocated 8184 at 0002DFE8
    allocated 8184 at 009ADFE8
    allocated 72 at 009A1A80
    allocated 0 at 009A1A80
    2034 x 1000000 = 2034000000 initially allocated and freed
    total reallocated 2137835176
    
    Commit   =   2091964 KB  =   2043 MB
    Work Set =     13236 KB  =     13 MB
    
    Kernel Time  =     0.031 =   37%
    User Time    =     0.000 =    0%
    Process Time =     0.031 =   37%
    Global Time  =     0.083 =  100%
    With hoard (compiled: cl /O2 /MD mt.cpp uselibhoard.cpp libhoard.dll). Note that the first two attempts to allocate a very large array fail, then the third succeeds although it is not as large.

    Code:
    1806 x 1000000 = 1806000000 allocated
    allocated 1175280 at 7FE50040
    allocated 2097150 at 7F490040
    allocated 1722875840 at 00A70040
    allocated 197132224 at 67640040
    allocated 137887680 at 77110040
    allocated 47054784 at 73280040
    allocated 8732624 at 76780040
    allocated 7277184 at 76080040
    allocated 5053600 at 7F970040
    allocated 1692416 at 00780040
    allocated 1175280 at 00940040
    allocated 816160 at 001E0040
    allocated 680128 at 7F8C0040
    allocated 566768 at 7F810040
    allocated 393584 at 00040040
    allocated 273312 at 00730040
    allocated 189792 at 001B0040
    allocated 131792 at 76050040
    allocated 76256 at 7F7F0040
    allocated 63536 at 00930040
    allocated 63536 at 00020040
    allocated 52944 at 76040040
    allocated 36752 at 7F8B0040
    allocated 4096 at 7F8A2040
    allocated 4096 at 7F8A4040
    allocated 4096 at 7F8A6040
    allocated 4096 at 7F8A8040
    allocated 4096 at 7F8AA040
    allocated 4096 at 7F8AC040
    allocated 4096 at 7F8AE040
    allocated 4096 at 7F8AD040
    allocated 4096 at 7F8AB040
    allocated 4096 at 7F8A9040
    allocated 4096 at 7F8A7040
    allocated 4096 at 7F8A5040
    allocated 4096 at 7F8A3040
    allocated 4096 at 7F8A1040
    allocated 64 at 7FF90080
    allocated 64 at 7FF900C0
    allocated 64 at 7FF90100
    allocated 64 at 7FF90140
    allocated 64 at 7FF90180
    allocated 64 at 7FF901C0
    allocated 64 at 7FF90200
    ... (about 3000 lines deleted)
    allocated 32 at 7FF8FEE0
    allocated 32 at 7FF8FF00
    allocated 32 at 7FF8FF20
    allocated 32 at 7FF8FF40
    allocated 32 at 7FF8FF60
    allocated 32 at 7FF8FF80
    allocated 32 at 7FF8FFA0
    allocated 32 at 7FF8FFC0
    allocated 32 at 7FF8FFE0
    allocated 0 at 7FF8FFE0
    1806 x 1000000 = 1806000000 initially allocated and freed
    total reallocated 2135686814
    
    Commit   =   2090620 KB  =   2042 MB
    Work Set =     12848 KB  =     13 MB
    
    Kernel Time  =     0.202 =   79%
    User Time    =     0.046 =   18%
    Process Time =     0.249 =   97%
    Global Time  =     0.256 =  100%
    Hoard doesn't have a 32 bit binary distribution so I built it from source using nmake and VC++ as described in their readme file.

  7. #7
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    If you go to the github Hoard/releases page, it looks like there are binaries for all platforms, including win32:

    https://github.com/emeryberger/Hoard/releases

    I'm not sure that's what it looks like, though.

    It also doesn't look like that will solve your main problem. Sorry about that.

    (BTW, performance for artificial benchmarks like that is often not indicative of actual performance; good allocators usually cache already-chopped-up-and-freed chunks of memory and can reallocate them from "quick lists" blazingly quickly. If all your program does is allocate all its memory and free it all, that defeats the heuristic.)

  8. #8
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    BTW, do you know if this happens under 32-bit Linux with the default glibc allocator? (Which at least until fairly recently was based on Doug Lea's malloc.)

    You might want to ask Emery Berger (the Hoard honcho) about this over at Stack Overflow. He's worked at Microsoft and they use some of his stuff in recent systems, so he may know what's up with whatever malloc they give you.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Thanks. I see the Win32 download now. The Hoard webpage only had Win64 but both 32 and 64 bit Linux.

    I also tried as a 32 bit program in 64 bit Ubuntu (g++ -O3 -m32 mt.cpp). Memory allocation seems better behaved. It is able to allocate 4 GB and there is no fragmentation as far as I could tell with various chunk sizes. (I had to get the 32 bit libraries using: sudo apt-get install g++-multilib). I didn't try Hoard.

    When compiled for 64 bit Linux it will allocate 140 TB in large chunks or cause disk thrashing with smaller chunks.

    Anyway, what I am doing is trying to use memory more efficiently in zpaq. Currently when it compresses in multiple threads, it saves input and output blocks in memory in contiguous blocks. But allocating large blocks is more likely to fail, and there is no reason for the output to be contiguous while it is waiting to be appended to the archive. So what I will do in the next version is save the output in smaller chunks, a little larger than the threshold for fragmentation.

  10. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Matt, 7-zip solves the problem by allocating all buffers once and then reusing them through the operation

  11. #11
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Matt:

    "When compiled for 64 bit Linux it will allocate 140 TB in large chunks or cause disk thrashing with smaller chunks."

    I'm not sure what you mean here. (And did you mean terabytes?)

    How much memory do you ever actually have live (allocated and not yet freed) at a time? How much virtual memory space does it use?

    Can you tell me a bit more about your threads and how they allocate memory?

    Do you have a fixed number of threads comparable to the number of cores that repeatedly read a buffer and write a buffer, or do you spawn a new thread for each buffer-full of compression work to do, and each thread allocates a single-use input and output buffer and frees it when it runs, or what?

  12. #12
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Yes, I mean 140 terabytes. The computer I am testing on has 4 GB of RAM and 3 GB of swap. Thus, the largest piece it can allocate is 7 GB. But if the memory is not accessed then it is never mapped to physical RAM or swap, so only the address space is reserved. The test program only writes to one page of each 7 GB array to save the pointer to the next array in the linked list, thus 20K nodes or 80 MB memory is actually used. The actual limitation is the 48 bit address space I guess. If I run the program with smaller node sizes, then it fills up the swap and it can run for hours, but with a larger node size like 1 GB it runs quickly.

    zpaq uses memory for the following. First it reads the index into a data structure with lots of small strings and vectors, so there is lots of fragmentation but not much memory used unless the index is huge. Each file allocates a string for the file name and a small vector for a list of versions, each of which has a small vector of fragment IDs. The largest single vector is the list of fragment hashes and sizes. It uses 32 bytes for each fragment (average size 64 KB), so it uses 0.5 MB for each GB of uncompressed data. g++ and MSVC have different behaviors when the vector grows beyond its allocated size. g++ allocates a new vector twice as large. MSVC grows it in smaller steps.

    For compression, there is an additional hash table mapping SHA-1 hashes to fragment IDs to quickly find matching fragments for deduplication. It is a fixed 4M sized vector of small but growing vectors, so again lots of small bits.

    Most of the memory is used for compression and decompression. First there is the memory used by the model and pre/post processor. This varies depending on the type of data in the block so it is not possible to allocate it all in advance. It requires in most cases large contiguous blocks of memory. For example, to compress a 64 MB block using BWT it requires a single block of 288 MB for the suffix array and partial inverse suffix array to compress. To decompress it requires a 64 MB block to hold the output and a 256 MB block for the IBWT. There is a smaller amount needed for modeling. LZ77 compression requires large arrays to hold the hash table or suffix array for finding matches. The CM methods require large arrays to hold tables mapping context hashes to bit histories, match model buffers and hash tables, and other data structures.

    zpaq also uses buffers to hold the input and output for multi-threading. The input buffers have to be contiguous because they are accessed randomly for preprocessing. For example, it may have a E8E9 transform followed by BWT, or may be used to confirm LZ77 matches. The input buffer may be modified before compression. The buffer size is user selectable but typically 16 to 64 MB. The output buffer does not have to be contiguous and may be variable sized depending on the compression ratio. In zpaq 6.49 it is contiguous but I am changing that. The output has to be held in memory so that it is written to the archive in the correct sequence. In zpaq 6.49 there are two jobs per thread but only one per thread may compress at the same time. The other holds input, waits for an available core, compresses, holds the output and waits for the jobs ahead of it to append to the archive. In 6.50 the job will free the input buffer when it is done compressing and waiting to write to the archive instead of after it is done writing.

    Decompression uses output buffers but not input buffers. Jobs read from the archive up to the number of fragments requested from each block and hold the output in order to serialize writing to the disk one file at a time. The idea is to minimize disk head movement and file fragmentation. It is also necessary because a decompressed fragment may occur in more than one copy of a file, so it may be written multiple times. These buffers also don't need to be contiguous but it is a little more complicated to fix this so it probably won't be in the next release. There is an optimization of not writing large blocks of all zeros, producing sparse files.

Similar Threads

  1. strange behavior with precomp and full path name
    By SvenBent in forum Data Compression
    Replies: 9
    Last Post: 6th September 2009, 18:45
  2. Strange gcc4.3 results with paq8o8
    By Hahobas in forum Forum Archive
    Replies: 8
    Last Post: 22nd March 2008, 19:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •