Thursday, January 18, 2024

Testing RAM Performance with Mismatched Modules

I upgraded the RAM in one of my computers and ended up with some extra 8 GB DDR4 sticks. My computers are mostly mini-ITX systems with only two RAM slots, so I normally buy matched pairs of RAM sticks, but for various reasons my main workstation had only a single slot filled with a 16 GB module. So I thought I'd see whether performance increased if I added an 8 GB module for 24 GB of RAM total.

A little research suggested that "dual-channel" mode would be a bit faster with matched sizes, but there's not widespread agreement on whether this would also work with mismatched sizes. For example, Socket 939 processors supported a "ganged" mode (see page 16) that combined two 64-bit channels into a 128-bit channel if DIMM sizes were matched across the two channels. Some people online spoke of a "flex mode" for partial dual-channel operation with mismatched modules, but others claimed this was an Intel-only technology.

It looks like AMD's Socket 939 had "A" and "B" pins for memory control signals, but the above document says they are identical signals, with an "A" and "B" pin provided for each to reduce loading when 4 DIMMs are connected. With the AMD 10h family, they changed this to be two independent memory channels but still supported "ganged" mode as a BIOS option for years.

So the question is, if the two DIMMs are accessed using independent memory controller channels, are the addresses still interleaved between DIMMs? If yes, is it still true if sizes are mismatched? I couldn't find any documentation of the Zen3 RAM controller that might help answer the question for current CPUs, so I just had to test it and see for myself.

The conclusion? Yes, mixing different-sized modules on different channels does increase RAM bandwidth for sequential writes, but not as much as using matched modules.

Test Details

I didn't know the standard way to test memory speeds on Linux, so I used a suggestion from here: https://serverfault.com/questions/372020/what-are-the-best-possible-ways-to-benchmark-ram-no-ecc-under-linux-arm

The suggestion amounted to running these commands:

$ cd /mnt
$ sudo mount -t tmpfs /mnt/test1 /mnt/test1
$ sudo dd if=/dev/zero of=/mnt/test1/test bs=1M

This just creates a RAM disk that's half as big as your RAM and fills it up with a bunch of zeros, then reports the average data rate. I tried it with three configurations:

  • A single 8-GB DDR4-3200 module: 5.7 GB/s
  • A 8-GB DDR4-3200 module plus a 16-GB DDR4-3200 module: 5.9 GB/s (+3.5%)
  • Two 8-GB DDR4-3200 modules: 6.2 GB/s (+8.7%)

What's going on here? Well, naively, you'd expect that filling RAM with zeroes would get close to the maximum speed of 3200e6 * 8 = 23.8 GB/s. But it's a little more complicated, since we're actually reading from /dev/zero, then copying the resulting buffer. The buffer is small enough to fit into my Zen3 CPU's L3 cache, so most of this is not limited by the RAM speed. In particular, the RAM controller doesn't have to switch between reading and writing, which could slow things down a lot. But for each byte, we're writing twice and reading once. On top of that, we have system call overhead and filesystem overhead from using tmpfs. With a single channel, this crude benchmark is getting about 25% of the maximum speed.

According to the unofficial AM4 pinout, each RAM channel has dedicated data, address, and control signals. So they should be able to operate independently. This would potentially double the maximum bandwidth to 57.6 GB/s if writes were interleaved between the two banks. My understanding is that Zen3's CPU cores can read and write 32 bytes (256 bits) from L3 on each ~4000-MHz clock cycle, so it should have no problem saturating two 64-bit RAM channels at 3200 MT/s.

If we interpreted the single-channel result as saying that 25% of the time was spent writing to RAM at the maximum transfer rate, then doubling the RAM speed would give us a 14% speedup (1/(1-0.25/2)=1.14). Instead, we get an 8.7% speedup.

The 24-GB case with mixed sizes is more interesting. Here, we get a 3.5% speedup. So presumably the memory controller is able to use both channels some of the time.

What's not clear is how the memory interleaving works. Is the memory controller able to interleave addresses with 8-byte granularity in both dual-channel cases? The 3.5% speed improvement versus 8.7% for matched sizes suggests that less than half of accesses are interleaved, even though we'd expect 2/3 of addresses to be interleavable.

I'd like to redo the testing with a much simpler rep movsb loop that has very little overhead. Unfortunately, after doing all this testing in my cramped mini-ITX case, the 16 GB RAM stick quit working, so I can't repeat it with a better benchmark!

No comments:

Post a Comment