Friday, January 19, 2024

More Testing of Mismatch RAM Modules

Here's a simple program to test whether dual-channel mode is being used for RAM.

Using this, I found that read speeds on my system increase even when I mismatch DIMM capacities in the A and B memory channels, but write speeds do not.

The code

// test.c
// compile with gcc -o test -O0 test.c
// then run with ./test
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

#define GB 8.0
#define SZ ((size_t) (GB*1073741824UL))

static double getTime(void)
{
  struct timeval t;
  gettimeofday(&t, NULL);
  return t.tv_sec + (double) t.tv_usec/1e6;
}

int main(int argc, char *argv[])
{
  uint8_t *p;
  double t1, t2;

  p = malloc(SZ);
  memset(p, 'a', SZ); // must write non-zero data first so the OS will actually map the pages to RAM

  // test write
  t1 = getTime();
  memset(p, 0, SZ);
  t2 = getTime();
  printf("Wrote %.1f GB in %f seconds -> %3.2f GB/s\n", GB, t2-t1, ((double)GB)/(t2-t1));

  // test write
  t1 = getTime();
  void *n = memchr(p, 'a', SZ);
  t2 = getTime();
  printf("Read %.1f GB in %f seconds -> %3.2f GB/s\n", GB, t2-t1, ((double)GB)/(t2-t1));
  printf("%p\n", n); // must use n for something to avoid memchr() being optimized out

  return 0;
}

The memset() and memchr() standard library functions are presumed to be heavily optimized for writing and reading from RAM, respectively. And in fact, memset() runs just a little slower than the theoretical limit, while memchr() is faster than rep lodsq, so I assume whoever wrote it is better at optimizing than I am.

With 16GB + 8GB of DDR4-3200 installed in memory channels A and B on my system, this prints:

Wrote 8.0 GB in 0.337975 seconds -> 23.67 GB/s
Read 8.0 GB in 0.249023 seconds -> 32.13 GB/s

The maximum possible bandwidth per channel is something like 3200e6*8 = 23.8 GB/s. It turns out that on a Ryzen 5700 processor, writes are bottlenecked somewhere between the CPU and the RAM controller, so we don't get the benefit of dual-channel writes. We'd expect that the reads would be up to twice as fast, but they're not. In fact, if I run the test for a 1-GB block instead of 8, reads and writes are the same speed. Presumably I'm getting memory that's mapped to a single DIMM in that scenario. I only get higher speeds for larger allocations that are more likely to use both DIMMs. At 16 GB, reads reach almost 33 GB/s.

So in conclusion, mismatched DIMM capacities in dual-channel mode on a Ryzen 5700 give some benefit, but read speeds don't reliably double.

By the way, benchmarks of this sort keep getting harder to write. The OS does not actually map pages to physical RAM until you write to them. If you try to use calloc() to initialize to zero, the OS maps all the pages to a read-only zero-filled page and still only maps them to physical RAM when you write. If you use malloc() followed by bzero() instead of calloc(), gcc's optimizer can replace that with a call to calloc()! And calls to memchr() are optimized out even at -O0 if you don't use the result somewhere. On the other hand, hand-written assembly is not reliably fast on today's CPUs. So be careful doing these kinds of tests with a modern compiler.

No comments:

Post a Comment