Strange heap-alignment sensitivity

2»

Comments

  • So what do we think is going on 'under the hood' to result in these huge differences (from a speed penalty when straddling cache lines of 'only' about 80-times to more than 1600-times in some of my tests)?

    Is it invalidating - and then reloading - the entire on-chip data cache or something, and therefore dependent on main memory bandwidth? And why would it do something so crazy when there is explicitly no bus-locking guarantee when the access is not 4-byte (32-bit) aligned?

    I'm going to continue to consider this behaviour a 'fault', albeit seemingly a longstanding one, unless and until I find a rational explanation.
  • It is very striking. I confess I'd initially misread your post and thought the desktop was at fault, but I see now it's the laptop.

    So one idea might be that the laptop has a fast CPU with a huge cache but a terribly slow external memory. You could perhaps test that by suitable pseudorandom accesses into a large array.

    But the idea that some cache is being invalidated is an interesting one. Perhaps even that the cachelines holding the code are being evicted and having be reloaded?

    Also it's interesting that your initial report has 4 out of 64 alignments problematic, but subsequent tests are showing 3 out of 64. Is that possibly because in the initial program there were two critical accesses, and they were themselves misaligned by one byte, so that two lots of 3 overlap to make one bunch of 4 alignments go wrong?

    Also, just possibly it's a typo, but your original tabulation shows 4, then 5, and then 4 alignments are problematic. If not a typo, I wonder if that's a big clue.
  • Oh, two more things I see on further inspection
    - one of my experiments shows numbers around 600, which is about half as bad as your 1300 sized numbers
    - you mention 13 microseconds, but I wonder, is 1300 actually 1.3 microseconds? So the problem is not quite as bad although still very surprising
  • BigEd wrote: »
    But the idea that some cache is being invalidated is an interesting one. Perhaps even that the cachelines holding the code are being evicted and having be reloaded?
    It looks like I have been misled by Microsoft's documentation. According to what it says here Intel states that the exchange instruction is still atomic even when the memory address is misaligned, contrary to what Microsoft says.

    That makes the behaviour somewhat more explicable, because achieving atomicity in the case of crossing a cache-line boundary must be difficult. Why it's so slow and so variable between systems I still don't understand.
    Also it's interesting that your initial report has 4 out of 64 alignments problematic, but subsequent tests are showing 3 out of 64. Is that possibly because in the initial program there were two critical accesses, and they were themselves misaligned by one byte
    Indeed, I believe that to be exactly the reason; two accesses which happen to be one-byte different in their alignment.
    you mention 13 microseconds, but I wonder, is 1300 actually 1.3 microseconds? So the problem is not quite as bad although still very surprising
    1300 centiseconds is 13 seconds, and the test program executes 1,000,000 (one million) instructions in that time. According to my arithmetic, 13 seconds divided by one million means each instruction takes 13 μs. I'm entirely happy to be proved wrong if you can show me the flaw in the argument.
  • Ah, of course, centiseconds!
  • This is perhaps a longshot, but if you trust the memory-seller Crucial, they have a tool on their site which will report exactly the RAM in your machine. As your numbers are only a factor of two different from some of mine - and you have a more modern CPU which might have pushed things further than on the CPUs I've measured - it does look to me like a plausible trend, that cache-line-spanning locked accesses have become more costly, presumably as a tradeoff against something else. It's likely, I think, that compilers won't use xchg with memory, so only hand-crafted code would suffer, so the tradeoff would look good to Intel.

    It would also be interesting if someone could test on an AMD CPU - they make different tradeoffs.
  • BigEd wrote: »
    Ah, of course, centiseconds!
    It's BBC BASIC's TIME pseudo-variable that the program is using, so I don't see how you could have been in any doubt about the units.
    I think, that compilers won't use xchg with memory
    What makes you think that? If you ask the compiler to build thread-safe binaries, I would expect it to be highly likely that it will use xchg with memory, or the related cmpxchg instruction, (with aligned addresses, of course) because it's a lightweight and fast way to achieve an atomic operation.

    Remember that with compiled code you (i.e. the C programmer) must explicitly state that an object is not, or might not be, aligned because otherwise the compiler will assume that it is. In the case of such an unaligned object then, yes, it will certainly avoid the use of xchg with memory, but not otherwise because it would be foregoing a valuable performance win.

    I could never distribute 'compiled' BBC BASIC for a 32-bit x86 platform (e.g. Windows) because even the best compiler is useless at optimising the code compared with my hand-crafted assembly language. There's something like a two-fold difference in performance, and a comparable difference in code size (which also impacts performance because of instruction cache efficiency).
  • To try to line up these results with the Intel microarch evolution:
    • Ivy Bridge mobile, introduced 2011, 85 centisecond penalty, i5-3210M, 2 cores, 3MB of L3 cache
    • Haswell, introduced 2014, 500-600 centisecond penalty, 10 cores, 25MB of L3 cache
    • Skylake, introduced 2017, 450 centisecond penalty, probably 12 cores, 16MB of L3 cache
    • Alder Lake, introduced 2021, 1300 centisecond penalty, maybe in this case (i7 1260) 2 fast + 8 efficient cores, 12 MB of L3 cache
  • BigEd wrote: »
    To try to line up these results with the Intel microarch evolution:
    How (if at all) are you normalising these results for clock speed? Unless you know the relative clock speeds for each of those systems the comparison is meaningless, and in the case of laptops there is the added complication that the clock speed is typically adjusted dynamically, and may well be lower when running on the battery then when connected to a mains adaptor.

    What you could try to do is increase the loop count to a figure much greater than 1,000,000 so that the 'fast' loops take a significant amount of time (say at least a second), and then measure the ratio between the slow loops and the fast loops. The snag is that the slow loops could then take as long as 15 minutes or so, and collating the measurements might take a long time.

  • BigEd
    edited March 2024
    Indeed, it's not at all clear what to scale against. But as the results go in the direction they do, it doesn't seem too big a problem to me. If we were to scale, as we're not looking at a few CPU clock cycles, I would think it would against the L3 access time or the RAM access time, neither of which we know, but both of which are not going to relate to CPU clock speed much. I think it's quite possible L3 access gets slower, for example.

    Have you run this xchg measurement on your PC? I'm wondering if there is anything to see there - obviously we expect it to be much less of an effect than the laptop, because that's what you first noticed. And, of course, what CPU is in the PC? Something is evidently different.
  • [Richard Russell]
    edited March 2024
    BigEd wrote: »
    Have you run this xchg measurement on your PC?
    It was around 75 cs in absolute terms on my main desktop PC, but that's also my fastest machine and I have no way of knowing what contribution the various components - CPU type, CPU clock, main-memory speed etc. - make to that figure.

    What I do know is that although measuring just the xchg eax,[rbx] instruction results in a large alignment-dependent difference in execution speed even on the desktop, the impact on a BASIC program (even the tightest FOR...NEXT loop) isn't enough to be noticeable subjectively on that PC.

    Similarly I've measured, as best I can, the impact of the effect on the overall result from the timing.bbc program I usually use as a benchmark, and it's negligible; I can't isolate it from other sources of variation from run to run.

    It's only on the laptop, with about a 20-fold worse impact of adverse alignment, that things start to be noticeable in a practical BASIC program.
  • Is the PC also Alder Lake? It's not impossible that Intel have bad microcode for this case and it hasn't yet been spotted and patched.
  • Doing a little reading, I see that random access latencies of modern DIMMs are in the 90-100ns range. So a mere 128 random uncached accesses would add up to 1300 centiseconds (if my arithmetic is right, which it might not be.)

    I also see that an Alder Lake CPU can run its memory controller at full speed or half speed, depending on the needs of the DIMM. In fact, if I understand correctly, a faster DIMM speed may demand the controller run at half speed. So the speed of the DIMMs vs the speed of the CPU would be relevant.

    I see that Intel offer a memory latency measuring tool, which could be interesting (if you think it worthwhile of course)

    Less official, I see there are tools out there which can report the clock speeds and latencies of various components in the memory system
    CPU-Z for Windows® x86/x64 is a freeware that gathers information on some of the main devices of your system :

    And it may be that the BIOS settings can be used to change some of the behaviour of your system - but it would be completely understandable if you didn't want to change anything there.
  • BigEd wrote: »
    Is the PC also Alder Lake?
    No, it's Kaby Lake.
  • Thanks - so that's a 2017 introduction, a little earlier than Alder Lake.

    One surviving hypothesis is that Alder Lake is the cause of the trouble, another is that there's something about the laptop itself or its setup.

    (Either way, of course, if I understand correctly, for BBC Basic there's an easy workaround for the user of changing variable names so the list isn't ping-ponging, and that's in the rare case that someone even finds a performance problem.)
  • BigEd wrote: »
    but it would be completely understandable if you didn't want to change anything there.
    What you may have momentarily lost sight of (Michael does the same with Matrix Brandy) is that BBC BASIC for Windows and BBC BASIC for SDL 2.0, which are equally affected by this issue, are products - albeit in the latter case free - which I make available for others to download and use.

    So how well or badly they work on my own PC(s) is completely unimportant to me, and making some change which may or may not improve performance on my laptop, but doesn't help other users, is of no interest to me at all. After all I'm not likely to be recommending to my users that they start changing BIOS settings. ;)
  • Oh, indeed so - just that there might be something to learn.
  • BigEd wrote: »
    One surviving hypothesis is that Alder Lake is the cause of the trouble, another is that there's something about the laptop itself or its setup.
    I'm not sure if you've posted any AMD results, but here's what an old Athlon gives:
       0 : 12              1 : 14              2 : 1               3 : 1
       4 : 1               5 : 1               6 : 1               7 : 8
       8 : 10              9 : 10             10 : 1              11 : 0
      12 : 1              13 : 1              14 : 1              15 : 10
      16 : 12             17 : 12             18 : 1              19 : 0
      20 : 1              21 : 6              22 : 0              23 : 13
      24 : 12             25 : 13             26 : 1              27 : 1
      28 : 3              29 : 1              30 : 1              31 : 8
      32 : 13             33 : 11             34 : 1              35 : 1
      36 : 1              37 : 1              38 : 1              39 : 13
      40 : 12             41 : 12             42 : 1              43 : 1
      44 : 1              45 : 5              46 : 1              47 : 13
      48 : 12             49 : 11             50 : 1              51 : 1
      52 : 3              53 : 1              54 : 1              55 : 11
      56 : 11             57 : 12             58 : 1              59 : 4
      60 : 1              61 : 1              62 : 1              63 : 12
    
    It seems to be penalising three addresses out of every eight rather than 3 out of 64 (but not by much, given it's a slow CPU anyway).
  • I have a result from Zen4 (a recent AMD chip) AMD Ryzen 7, 7840U, 16 core, 16MB L3
       0 : 1               1 : 0               2 : 0               3 : 1            
       4 : 0               5 : 0               6 : 1               7 : 0            
       8 : 0               9 : 1              10 : 0              11 : 0              
      12 : 1              13 : 0              14 : 0              15 : 1           
      16 : 0              17 : 0              18 : 1              19 : 0            
      20 : 1              21 : 0              22 : 0              23 : 1            
      24 : 0              25 : 0              26 : 1              27 : 0            
      28 : 142            29 : 133            30 : 76             31 : 1            
      32 : 0              33 : 0              34 : 0              35 : 0            
      36 : 0              37 : 1              38 : 0              39 : 0            
      40 : 0              41 : 0              42 : 1              43 : 0            
      44 : 0              45 : 0              46 : 0              47 : 0            
      48 : 1              49 : 0              50 : 0              51 : 0            
      52 : 0              53 : 1              54 : 0              55 : 0            
      56 : 0              57 : 0              58 : 1              59 : 0            
      60 : 0              61 : 0              62 : 0              63 : 1            
    
  • BigEd
    edited March 2024
    I have a result from an even newer Intel CPU, i7-1365U which is Raptor Lake, 2023, 12 cores, 12MB of L3:
       0 : 2               1 : 1               2 : 0               3 : 0            
       4 : 2               5 : 0               6 : 1               7 : 0            
       8 : 0               9 : 1              10 : 1              11 : 0            
      12 : 0              13 : 1              14 : 0              15 : 0            
      16 : 0              17 : 1              18 : 0              19 : 0            
      20 : 1              21 : 0              22 : 1              23 : 0            
      24 : 1              25 : 0              26 : 0              27 : 1            
      28 : 1068           29 : 1018           30 : 1054           31 : 1            
      32 : 0              33 : 0              34 : 1              35 : 0            
      36 : 0              37 : 1              38 : 0              39 : 0            
      40 : 1              41 : 1              42 : 0              43 : 0            
      44 : 1              45 : 0              46 : 0              47 : 1            
      48 : 1              49 : 0              50 : 1              51 : 0            
      52 : 1              53 : 0              54 : 1              55 : 0            
      56 : 0              57 : 1              58 : 0              59 : 0            
      60 : 0              61 : 1              62 : 0              63 : 0            
    

    In this case the penalty is a little less severe than the worst we've seen, but still rather a lot.

    I think this shows that your laptop, Richard, is just doing what that particular CPU is going to do, for whatever internal implementation reason.

    I do note from readings that writes spanning a 4k boundary have an even bigger penalty than those spanning cache lines!
  • [Richard Russell]
    edited March 2024
    BigEd wrote: »
    I have a result from an even newer Intel CPU, i7-1365U which is Raptor Lake, 2023, 12 cores, 12MB of L3:
    What clock speed? If it's running in a desktop, and therefore with a faster clock than my laptop, I suspect you'll find those figures equate to a monstrous penalty when expressed in clock cycles (could be around 40,000 for just one instruction)!

    But as you say this is something which we are going to have to live with. I don't think there's anything to be gained by more analysis, and statistically it's unlikely to have much impact on a BASIC program.

    There are dozens of exchange-with-memory instructions in the assembly-language code of my interpreters, but the vast majority of them are xchg esp,[savesp] which is a fixed, and aligned, memory address so there won't be any penalty.

    Those instructions are key to the performance of my interpreters because they are what make it possible to use the same stack for BASIC's looping and nesting structures and for the CPU's call/return and push/pop stack.

    This single-stack architecture is elegant and relieves pressure on the very limited register resources of the 32-bit x86, but it does mean that when calling any 'external' routine (e.g. in the OS or a third-party DLL) you have to switch back to the conventional stack:
    xchg esp,[savesp]
    call external_routine
    xchg esp,[savesp]
    

    Incidentally it's something that is impossible to achieve when coding in C, meaning that it's one of the penalties of coding in a high-level language which can never be overcome by improving how 'clever' the compiler is. Having to use a separate register for BASIC's stack pointer (and one which can't benefit from shortcut instructions like push and pop) is a major overhead.

    If I was expecting ever to write x86 assembly language code again, which I'm not, it would be wise to avoid using an exchange-with-memory instruction when the alignment isn't predictable.

    I'm planning to leave this forum, no doubt much to your relief. You might like to tell the StarDot members that I'm finally giving up; they will be delighted.
  • Hi Richard, no, I feel no relief at the idea of you leaving, quite the reverse. I think we've had many interesting and informative posts and threads here and I would prefer that we continue to do so.

    It's good that other uses of xchg in your interpreters are aligned - this peculiarity of a high cost to unaligned usages will not be relevant. The information we have so far indicates to me that the problem is due to, and restricted to, late-model Intel CPUs. I've found it an interesting investigation, even if the net effect is not to take any particular action.

    (And yes, the CPU last measured is clocked very fast indeed, up to 5.2GHz, but as it happens is, surprisingly, also in a laptop.)