Indeed evidence shows the layout matters greatly for the encoder, but not so much the decoder. Infact encoding/decoding from files rather than my in-memory test activated with "-t" shows different performance characteristics. Possibly the in/out block memory alignment is changing too. The test mode decoder runs slower than when decoding from file(!) when using exactly 1MB block size.

Anyway, perf is showing nice numbers to back up this theory in part at least.

Input is my 8-binned quality file (so low entropy):

1013*1047 block size

Code:

@ deskpro107386[compress.../rans_sta...]; perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out -o1 /tmp/in /tmp/out;perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./a.out -d /tmp/out /dev/null
Took 3191690 microseconds, 229.1 MB/s
Performance counter stats for './a.out -o1 /tmp/in /tmp/out':
10930222964 L1-dcache-load
79789097 L1-dcache-loads-misses # 0.73% of all L1-dcache hits
4169473999 L1-dcache-stores
23278924 L1-dcache-store-misses
3.659998681 seconds time elapsed
Took 709925 microseconds, 1030.0 MB/s
Performance counter stats for './a.out -d /tmp/out /dev/null':
2619188081 L1-dcache-load
68030637 L1-dcache-loads-misses # 2.60% of all L1-dcache hits
422798953 L1-dcache-stores
22065522 L1-dcache-store-misses
0.710987393 seconds time elapsed

1024*1024 block size

Code:

@ deskpro107386[compress.../rans_sta...]; perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./b.out -o1 /tmp/in /tmp/out;perf stat -e L1-dcache-load,L1-dcache-loads-misses,L1-dcache-stores,L1-dcache-store-misses ./b.out -d /tmp/out /dev/null
Took 3399114 microseconds, 215.1 MB/s
Performance counter stats for './b.out -o1 /tmp/in /tmp/out':
10916794653 L1-dcache-load
804240191 L1-dcache-loads-misses # 7.37% of all L1-dcache hits
4161272365 L1-dcache-stores
29233954 L1-dcache-store-misses
3.861525937 seconds time elapsed
Took 725868 microseconds, 1007.4 MB/s
Performance counter stats for './b.out -d /tmp/out /dev/null':
2619148917 L1-dcache-load
58870968 L1-dcache-loads-misses # 2.25% of all L1-dcache hits
422310647 L1-dcache-stores
26928287 L1-dcache-store-misses
0.727073387 seconds time elapsed

On something like enwik8 the entropy is much higher so the number of order-1 permutations is much higher leading to higher overall used memory access. However it's still 4% vs 10% L1-dcache-load miss rates between the two versions, so significant. The decodes show slightly better cache performance on the 1024*1024 size, so opposite to the expected behaviour. However as mentioned above, the file to file encode/decode is much more even in speed between the two versions anyway.

PS. For the same reason, my 4 or 8 histogram buffers for computing the frequencies are sized as (eg) F0[256+MAGIC] where MAGIC is a #define to 8 or similar. This had a huge impact on our older AMD servers with 2-way associative hashes. I see FSE_count_parallel has 256 for these. I should probably benchmark that with a slightly higher number on those servers to see.

Edit: yes, it helps. Alas the kernel doesn't support L1-dcache-store-misses on our "AMD Opteron(tm) Processor 6378", but I can get other figures out:

Array size 256, ewiki8 input *100 trials.

Code:

5905.845184 task-clock # 1.000 CPUs utilized
17291179661 cycles # 2.928 GHz [75.00%]
238385676 stalled-cycles-frontend # 1.38% frontend cycles idle [75.01%]
5345616530 stalled-cycles-backend # 30.92% backend cycles idle [50.00%]

256+8

Code:

5505.574011 task-clock # 1.000 CPUs utilized
16392730803 cycles # 2.977 GHz [75.01%]
251208399 stalled-cycles-frontend # 1.53% frontend cycles idle [75.01%]
4349577595 stalled-cycles-backend # 26.53% backend cycles idle [50.02%]